Download as pdf or txt
Download as pdf or txt
You are on page 1of 167

Series ISSN: 1935-4185

ZHAO
Synthesis Lectures on
Communication Networks

Multi-Armed Bandits
Series Editor: R. Srikant, University of Illinois at Urbana-Champaign

Multi-Armed Bandits
Theory and Applications to Online Learning in Networks
Qing Zhao, Cornell University Theory and Applications
to Online Learning
Multi-armed bandit problems pertain to optimal sequential decision making and learning in
unknown environments. Since the first bandit problem posed by Thompson in 1933 for the
application of clinical trials, bandit problems have enjoyed lasting attention from multiple

MULTII-ARMED BANDITS
research communities and have found a wide range of applications across diverse domains.This

in Networks
book covers classic results and recent development on both Bayesian and frequentist bandit
problems. We start in Chapter 1 with a brief overview on the history of bandit problems,
contrasting the two schools—Bayesian and frequentist—of approaches and highlighting
foundational results and key applications. Chapters 2 and 4 cover, respectively, the canonical
Bayesian and frequentist bandit models. In Chapters 3 and 5, we discuss major variants of the
canonical bandit models that lead to new directions, bring in new techniques, and broaden
the applications of this classical problem. In Chapter 6, we present several representative
application examples in communication networks and social-economic systems, aiming to
illuminate the connections between the Bayesian and the frequentist formulations of bandit
problems and how structural results pertaining to one may be leveraged to obtain solutions
under the other.
Qing Zhao
About SYNTHESIS
This volume is a printed version of a work that appears in the Synthesis
Digital Library of Engineering and Computer Science. Synthesis

MORGAN & CLAYPOOL


books provide concise, original presentations of important research and
development topics, published quickly, in digital and print formats.

Synthesis Lectures on
Communication Networks
store.morganclaypool.com
R. Srikant, Series Editor
Multi-Armed Bandits
Theory and Applications to
Online Learning in Networks
Synthesis Lectures on
Communication Networks
Editor
R. Srikant, University of Illinois at Urbana-Champaign
Founding Editor Emeritus
Jean Walrand, University of California, Berkeley
Synthesis Lectures on Communication Networks is an ongoing series of 75- to 150-page publications
on topics on the design, implementation, and management of communication networks. Each
lecture is a self-contained presentation of one topic by a leading expert. The topics range from
algorithms to hardware implementations and cover a broad spectrum of issues from security to
multiple-access protocols. The series addresses technologies from sensor networks to reconfigurable
optical networks.
The series is designed to:
• Provide the best available presentations of important aspects of communication networks.
• Help engineers and advanced students keep up with recent developments in a rapidly
evolving technology.
• Facilitate the development of courses in this field

Multi-Armed Bandits: Theory and Applications to Online Learning in Networks


Qing Zhao
2019

Diffusion Source Localization in Large Networks


Lei Ying and Kai Zhu
2018

Communications Networks: A Concise Introduction, Second Edition


Jean Walrand and Shyam Parekh
2017

BATS Codes: Theory and Practice


Shenghao Yang and Raymond W. Yeung
2017
iv
Analytical Methods for Network Congestion Control
Steven H. Low
2017

Advances in Multi-Channel Resource Allocation: Throughput, Delay, and Complexity


Bo Ji, Xiaojun Lin, and Ness B. Shroff
2016

A Primer on Physical-Layer Network Coding


Soung Chang Liew, Lu Lu, and Shengli Zhang
2015

Sharing Network Resources


Abhay Parekh and Jean Walrand
2014

Wireless Network Pricing


Jianwei Huang and Lin Gao
2013

Performance Modeling, Stochastic Networks, and Statistical Multiplexing, Second


Edition
Ravi R. Mazumdar
2013

Packets with Deadlines: A Framework for Real-Time Wireless Networks


I-Hong Hou and P.R. Kumar
2013

Energy-Efficient Scheduling under Delay Constraints for Wireless Networks


Randall Berry, Eytan Modiano, and Murtaza Zafer
2012

NS Simulator for Beginners


Eitan Altman and Tania Jiménez
2012

Network Games: Theory, Models, and Dynamics


Ishai Menache and Asuman Ozdaglar
2011

An Introduction to Models of Online Peer-to-Peer Social Networking


George Kesidis
2010
v
Stochastic Network Optimization with Application to Communication and Queueing
Systems
Michael J. Neely
2010

Scheduling and Congestion Control for Wireless and Processing Networks


Libin Jiang and Jean Walrand
2010

Performance Modeling of Communication Networks with Markov Chains


Jeonghoon Mo
2010

Communication Networks: A Concise Introduction


Jean Walrand and Shyam Parekh
2010

Path Problems in Networks


John S. Baras and George Theodorakopoulos
2010

Performance Modeling, Loss Networks, and Statistical Multiplexing


Ravi R. Mazumdar
2009

Network Simulation
Richard M. Fujimoto, Kalyan S. Perumalla, and George F. Riley
2006
Copyright © 2020 by Morgan & Claypool

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in
any form or by any means—electronic, mechanical, photocopy, recording, or any other except for brief quotations
in printed reviews, without the prior permission of the publisher.

Multi-Armed Bandits: Theory and Applications to Online Learning in Networks


Qing Zhao
www.morganclaypool.com

ISBN: 9781627056380 paperback


ISBN: 9781627058711 ebook
ISBN: 9781681736372 hardcover

DOI 10.2200/S00941ED2V01Y201907CNT022

A Publication in the Morgan & Claypool Publishers series


SYNTHESIS LECTURES ON COMMUNICATION NETWORKS

Lecture #22
Series Editor: R. Srikant, University of Illinois at Urbana-Champaign
Founding Editor Emeritus: Jean Walrand, University of California, Berkeley
Series ISSN
Print 1935-4185 Electronic 1935-4193
Multi-Armed Bandits
Theory and Applications to
Online Learning in Networks

Qing Zhao
Cornell University

SYNTHESIS LECTURES ON COMMUNICATION NETWORKS #22

M
&C Morgan & cLaypool publishers
ABSTRACT
Multi-armed bandit problems pertain to optimal sequential decision making and learning in
unknown environments. Since the first bandit problem posed by Thompson in 1933 for the ap-
plication of clinical trials, bandit problems have enjoyed lasting attention from multiple research
communities and have found a wide range of applications across diverse domains. This book cov-
ers classic results and recent development on both Bayesian and frequentist bandit problems. We
start in Chapter 1 with a brief overview on the history of bandit problems, contrasting the two
schools—Bayesian and frequentist—of approaches and highlighting foundational results and
key applications. Chapters 2 and 4 cover, respectively, the canonical Bayesian and frequentist
bandit models. In Chapters 3 and 5, we discuss major variants of the canonical bandit models
that lead to new directions, bring in new techniques, and broaden the applications of this clas-
sical problem. In Chapter 6, we present several representative application examples in commu-
nication networks and social-economic systems, aiming to illuminate the connections between
the Bayesian and the frequentist formulations of bandit problems and how structural results
pertaining to one may be leveraged to obtain solutions under the other.

KEYWORDS
multi-armed bandit, machine learning, online learning, reinforcement learning,
Markov decision processes
ix

To Peter Whittle
and to Lang and Everett.
xi

Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Multi-Armed Bandit Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 An Essential Conflict: Exploration vs. Exploitation . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Two Formulations: Bayesian and Frequentist . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3.1 The Bayesian Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.2 The Frequentist Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Bayesian Bandit Model and Gittins Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7


2.1 Markov Decision Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Policy and the Value of a Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.2 Optimality Equation and Dynamic Programming . . . . . . . . . . . . . . . . 9
2.2 The Bayesian Bandit Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Gittins Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.1 Gittins Index and Forward Induction . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.2 Interpretations of Gittins Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.3 The Index Process, Lower Envelop, and Monotonicity of the
Stopping Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4 Optimality of the Gittins Index Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.5 Computing Gittins Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.5.1 Offline Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.5.2 Online Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.6 Semi-Markov Bandit Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3 Variants of the Bayesian Bandit Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31


3.1 Necessary Assumptions for the Index Theorem . . . . . . . . . . . . . . . . . . . . . . . . 31
3.1.1 Modeling Assumptions on the Action Space . . . . . . . . . . . . . . . . . . . . 32
3.1.2 Modeling Assumptions on the System Dynamics . . . . . . . . . . . . . . . . 33
xii
3.1.3 Modeling Assumptions on the Reward Structure . . . . . . . . . . . . . . . . 34
3.1.4 Modeling Assumptions on the Performance Measure . . . . . . . . . . . . . 34
3.2 Variations in the Action Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2.1 Multitasking: The Bandit Superprocess Model . . . . . . . . . . . . . . . . . . 35
3.2.2 Bandits with Precedence Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2.3 Open Bandit Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.3 Variations in the System Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.3.1 The Restless Bandit Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.3.2 Indexability and Whittle Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.3.3 Optimality of Whittle Index Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.3.4 Computational Approaches to Restless Bandits . . . . . . . . . . . . . . . . . 50
3.4 Variations in the Reward Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.4.1 Bandits with Rewards under Passivity . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.4.2 Bandits with Switching Cost and Switching Delay . . . . . . . . . . . . . . . 51
3.5 Variations in Performance Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.5.1 Stochastic Shortest Path Bandit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.5.2 Average-Reward and Sensitive-Discount Criteria . . . . . . . . . . . . . . . . 55
3.5.3 Finite-Horizon Criterion: Bandits with Deadlines . . . . . . . . . . . . . . . 56

4 Frequentist Bandit Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57


4.1 Basic Formulations and Regret Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.1.1 Uniform Dominance vs. Minimax . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.1.2 Problem-Specific Regret and Worst-Case Regret . . . . . . . . . . . . . . . . 59
4.1.3 Reward Distribution Families and Admissible Policy Classes . . . . . . 60
4.2 Lower Bounds on Regret . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.2.1 The Problem-Specific Regret . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.2.2 The Minimax Regret . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.3 Online Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.3.1 Asymptotically Optimal Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.3.2 Order-Optimal Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.4 Connections between Bayesian and Frequentist Bandit Models . . . . . . . . . . . 79
4.4.1 Frequentist Approaches to Bayesian Bandits . . . . . . . . . . . . . . . . . . . . 79
4.4.2 Bayesian Approaches to Frequentist Bandits . . . . . . . . . . . . . . . . . . . . 80

5 Variants of the Frequentist Bandit Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85


5.1 Variations in the Reward Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.1.1 Rested Markov Reward Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
xiii
5.1.2 Restless Markov Reward Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.1.3 Nonstationary Reward Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.1.4 Nonstochastic Reward Processes: Adversarial Bandits . . . . . . . . . . . . 92
5.2 Variations in the Action Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.2.1 Large-Scale Bandits with Structured Action Space . . . . . . . . . . . . . . . 94
5.2.2 Constrained Action Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.3 Variations in the Observation Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.3.1 Full-Information Feedback: The Expert Setting . . . . . . . . . . . . . . . . . 99
5.3.2 Graph-Structured Feedback: Bandits with Side Observations . . . . . 100
5.3.3 Constrained and Controlled Feedback: Label-Efficient Bandits . . . 101
5.3.4 Comparative Feedback: Dueling Bandits . . . . . . . . . . . . . . . . . . . . . . 101
5.4 Variations in the Performance Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.4.1 Risk-Averse Bandits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.4.2 Pure-Exploration Bandits: Active Inference . . . . . . . . . . . . . . . . . . . 108
5.5 Learning in Context: Bandits with Side Information . . . . . . . . . . . . . . . . . . . 112
5.6 Learning under Competition: Bandits with Multiple Players . . . . . . . . . . . . 115
5.6.1 Centralized Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.6.2 Distributed Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

6 Application Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117


6.1 Communication and Computer Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
6.1.1 Dynamic Multichannel Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
6.1.2 Adaptive Routing under Unknown Link States . . . . . . . . . . . . . . . . 120
6.1.3 Heavy Hitter and Hierarchical Heavy Hitter Detection . . . . . . . . . . 121
6.2 Social-Economic Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
6.2.1 Dynamic Pricing and the Pursuit of Complete Learning . . . . . . . . . 123
6.2.2 Web Search, Ads Display, and Recommendation Systems:
Learning to Rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

Author’s Biography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147


xv

Preface
The term “multi-armed bandit” comes from likening an archetypal online learning problem to
playing a slot machine that has multiple arms (slot machines are also known as bandits due to
their ability to empty the player’s pockets). Each arm, when pulled, generates random rewards
drawn from an unknown distribution or a known distribution with an unknown mean. The
player chooses one arm to pull at each time, with the objective of accumulating, in expectation,
as much reward as possible over a given time horizon. The tradeoff facing the player is a classic
one, that is, to explore a less observed arm which may hold a greater potential for the future or
to exploit an arm with a history of offering good rewards. It is this tension between learning and
earning that lends complexity and richness to the bandit problems.
As in many problems involving unknowns, bandit problems can be treated within the
Bayesian or frequentist frameworks, depending on whether the unknowns are viewed as random
variables with known prior distributions or as deterministic quantities. These two schools have
largely evolved independently. In recent years, we witness increased interests and much success
in cross-pollination between the two schools. It is my hope that by covering both the Bayesian
and frequentist bandit models, this book further stimulates research interests in this direction.
We start in Chapter 1 with an overview on the history and foundational results of the
bandit problems within both frameworks. In Chapters 2 and 4, we devote our attention to the
canonical Bayesian and frequentist formulations. Major results are treated in detail. Proofs for
key theorems are provided.
New and emerging applications in computer science, engineering, and social-economic
systems give rise to a diverse set of variants of the classical models, generating new directions
and bringing in new techniques to this classical problem. We discuss major variants under the
Bayesian framework and the frequentist framework in Chapters 3 and 5, respectively. The cover-
age, inevitably incomplete, focuses on the general formulations and major results with technical
details often omitted. Special attention is given to the unique challenges and additional struc-
tures these variants bring to the original bandit models. Being derivative to the original models,
these variants also offer a deeper appreciation and understanding of the core theory and tech-
niques. In addition to bringing awareness of new bandit models and providing reference points,
these two chapters point out unexplored directions and open questions.
In Chapter 6, we present application examples of the bandit models in communication
networks and social-economic systems. While these examples provide only a glimpse of the
expansive range of potential applications of bandit models, it is my hope that they illustrate two
fruitful research directions: applications with additional structures that admit stronger results
than what can be offered by the general theory, and applications bringing in new objectives and
xvi PREFACE
constraints that push the boundaries of the bandit models. These examples are chosen also to
show the connections between the Bayesian and frequentist formulations and how structural
results pertaining to one may be leveraged to obtain solutions under the other.

Qing Zhao
Ithaca, NY, August 2019
xvii

Acknowledgments
In December 2015, Srikant, the editor of the series, asked whether I would be interested in
writing a book on multi-armed bandits. By that time, I had worked on bandit problems for
a decade, starting with the Bayesian and then the frequentist. I was quite confident in taking
on the task and excited with the ambition of bringing together the two schools of approaches
together within one book, which I felt was lacking in the literature and was much needed. When
asked of a timeframe for finishing the book, I gave an estimate of one year. “That ought to leave
me plenty of margin.” I thought. My son, Everett, was one year old then.
Everett is starting kindergarten next week.
Writing this book has been a humbling experience. The vast landscape of the existing
literature, both classical and new, reincarnations of ideas, often decades apart, and quite a few
reinventions of wheels (with contributions from myself in that regard), have made the original
goal of giving a comprehensive coverage and respecting the historical roots of all results seem
unattainable at times. If it were not for the encouragement and persistent nudging from Srikant
and the publisher Michael Morgan, the book would have remained unfinished forever. I do not
think I have achieved the original goal. This is a version I can, at least, live with.
Many people have helped me in learning this fascinating subject. The first paper I read on
bandit problems was “Playing Golf with Two Balls” pointed to me by Vikram Krishnamurthy,
then a professor at UBC and now my colleague at Cornell. It was the summer of 2005 when I
visited Vikram. We were working on a sensor scheduling problem under the objective of network
lifetime maximization, which leads to a stochastic shortest-path bandit. The appeal of the bandit
problems was instantaneous and has never faded, and I must admit a stronger affection towards
the Bayesian models, likely due to the earlier exposure. Special thanks go to Peter Whittle of
the University of Cambridge. I am forever grateful to his tremendous encouragement through
my career and generous comments on our results on restless bandits. His incisive writing has
always been an inspiration. Many thanks to my students, past and current, who taught me most
things I know about bandits through presentations in our endless group meetings and through
their research, in particular, Keqin Liu and Sattar Vakili whose dissertations focused almost
exclusively on bandit problems.
My deepest appreciation goes to my husband, Lang, for letting me hide away for weeks
finishing up a first draft while he took care of Everett, for agreeing to read the draft and providing
comments and actually did so for the Introduction! I thank my dear Everett for the many hours
sitting patiently next to me, copying on his iPad every letter I typed. It was this July in Sweden
when there was no daycare and I was trying to wrap up the book. He has been the most faithful
reader of the book, who read, not word by word, but letter by letter.
xviii ACKNOWLEDGMENTS
I thank the U.S. National Science Foundation, the Army Research Office, and the Army
Research Lab for supporting the research of my group and the talented Ph.D. students I worked
with. I am grateful to the support during my 2018–2019 sabbatical leave in Sweden from the
European Union through a Marie Skłodowska-Curie grant1 and from the Chalmers University
of Technology through a Jubilee Professorship. Their generous support has allowed me to put
in continuous effort on the book and finally push it over the finish line.

Qing Zhao
Ithaca, NY, August 2019

1 Current supporting grants include the National Science Foundation Grant CCF-1815559, the Army Research Labo-
ratory Network Science CTA under Cooperative Agreement W911NF-09-2-0053, and the European Unions Horizon 2020
research and innovation programme under the Marie Skłodowska-Curie grant agreement No 754412.
1

CHAPTER 1

Introduction
We start with a brief overview of the multi-armed bandit problems, contrasting two schools of
approaches—Bayesian and frequentist—and highlighting foundational results and key applica-
tions. Concepts introduced informally here will be treated with care and rigor in later chapters.

1.1 MULTI-ARMED BANDIT PROBLEMS

The first multi-armed bandit problem was posed by Thompson, 1933 [190], for the application
of clinical trial. The problem considered there is as follows. Suppose that two experimental treat-
ments are available for a certain disease. The effectiveness of these two treatments is unknown.
The decision on which treatment to use on each patient is made sequentially in time based on
responses of past patients to their prescribed treatment. In Thompson’s study, a patient’s re-
sponse to a treatment is assumed to be dichotomous: success or failure. The two treatments are
then equated with two Bernoulli random variables with unknown parameters 1 and 2 . The
objective is to prescribe to as many patients as possible the treatment that has a higher mean
value.
In earlier studies, this type of sequential learning problems was referred to as “sequential
design of experiments” (see, for example, Robbins, 1952 [168] and Bellman, 1956 [30]). The
term “multi-armed bandit” first appeared in several papers published in the late 1950s to early
1960s (see Bradt, Johnson, and Karlin, 1956 [43], Vogel, 1960a, 1960b [203, 204], and Feldman,
1962 [79]), by likening the problem to playing a bandit slot machine with multiple arms, each
generating random rewards drawn from a distribution with an unknown mean i .
Bandit problems arise in a diverse range of applications. For instance, in communication
networks, the problem may be in the form of which channels to access, which paths to route
packets, which queues to serve, or which sensors to activate. For applications in social and eco-
nomic systems, arms may represent products or movies to recommend, prices to set for a new
product, or documents and sponsored ads to display on a search engine. The classical problem of
stochastic optimization concerned with the minimization of an unknown random loss function
can also be viewed as a bandit problem in disguise. In this case, arms, given by the domain of
the loss function, are no longer discrete.
2 1. INTRODUCTION
1.2 AN ESSENTIAL CONFLICT: EXPLORATION VS.
EXPLOITATION
The essence of the multi-armed bandit problems is in the tradeoff between exploitation and ex-
ploration where the player faces the conflicting objectives of playing the arm with the best reward
history and playing a less explored arm to learn its reward mean so that better decisions can be
made in the future. This essential conflict was well articulated by Whittle in the foreword to
the first edition of Gittins’s monograph on multi-armed bandits (1989 [89]): “Bandit problems
embody in essential form a conflict evident in all human action: choosing actions which yield
immediate reward vs. choosing actions (e.g., acquiring information or preparing the ground)
whose benefit will come only later.”
This tradeoff between exploitation and exploration was sharply illustrated with a simple
example in the monograph by Berry and Fristedt, 1985 [33]. Suppose there are two coins with
bias 1 and 2 . It is known that 1 D 12 (a fair coin) and 2 is either 1 (a coin with two heads)
or 0 (a coin with two tails) with probability 41 and 34 , respectively. At each time, one coin is
selected and flipped. The objective is to maximize the expected number of heads in T coin flips.
Since the immediate reward from flipping coin 1 is 21 comparing with 14 from flipping coin 2, a
myopic strategy that aims solely at maximizing immediate reward would flip coin 1 indefinitely.
However, it is easy to see that for T > 3, a better strategy is to flip coin 2 initially and then, upon
observing a tail, switch to coin 1 for the remaining time horizon. It is not difficult to show that
this strategy is optimal and the improvement over the myopic strategy is .T 3/=8.
In this simple example, arm 1 is known (thus no information to be gained from playing it)
whereas a single play of arm 2 reveals complete information. The tradeoff between exploitation
(playing arm 1) and exploration (playing arm 2) is thus easy to appreciate and simple to quantify.
In a general bandit problem, however, balancing immediate payoff and information is seldom
this simple.

1.3 TWO FORMULATIONS: BAYESIAN AND


FREQUENTIST
Similar to other problems with unknowns, there are two schools of approaches to formulating
the bandit problems. One is the so-called Bayesian point of view in which the unknown values
fi g are random variables with given prior distributions. In this case, the learning algorithm aims
to learn the realizations of fi g by sharpening the prior distributions using each observation.
The performance of the algorithm is measured by averaging over all possible realizations of
fi g under the given prior distributions. The other is the frequentist point of view in which
fi g are deterministic unknown parameters. Under the frequentist formulation, the problem is
a special class of reinforcement learning, where the decision maker learns by coupling actions
with observations. Not averaged over all possible values of fi g, the performance of a learning
algorithm is inherently dependent on the specific values of the unknown parameters fi g.
1.3. TWO FORMULATIONS: BAYESIAN AND FREQUENTIST 3
To contrast these two approaches, consider again the archetypal problem of flipping two
coins with unknown biases. The Bayesian approach can be interpreted as addressing the follow-
ing scenario: we are given two buckets of coins, each containing a known composition of coins
with different biases; we randomly selection one coin from each bucket and start flipping the
two chosen coins. Under the frequentist approach, however, we face two coins that we have no
prior knowledge about whether or which buckets they may come from.

1.3.1 THE BAYESIAN FRAMEWORK


The first Bandit problem posed by Thompson was studied under the Bayesian approach, where
Thompson adopted the uniform distribution as the prior and focused on maximizing the ex-
pected number of successes in T trials. Known as Thompson sampling, the solution he proposed
has been studied and applied within both Bayesian and non-Bayesian frameworks and continues
to receive great attention.
Bellman, 1956 [30] formulated the Bayesian bandit problem as a Markov decision process
(MDP) and proposed an optimal solution based on dynamic programming. To see the Bayesian
bandit problem as an MDP, one only needs to realize that a sufficient statistic for making op-
timal decisions at any given time is the posterior distributions of fi g computed from the prior
distributions and past observations using the Bayes rule. The resulting MDP formulation is
readily seen by treating the posterior distributions as the information state which evolves as a
Markov process under a given strategy. The Bayesian bandit model defined as a class of MDP
problems, however, represents a much broader family of decision problems than the original
problem of sampling unknown processes: the state of the MDP is not necessarily restricted to
being informational, and the state transitions do not necessarily obey the Bayes rule.
While dynamic programming offers a general technique for obtaining the optimal solu-
tion, its computational complexity grows exponentially with the number of arms since the state
space of the resulting MDP is the Cartesian product of the state spaces of all arms. A natural
question is whether bandit problems, as a special class of MDP, possess sufficient structures that
admit simple optimal solutions.
The problem fascinated the research community for decades, while the answer eluded
them until early 1970s. A legend, as told by Whittle,1 1980, put it this way: “The problem is
a classic one; it was formulated during the war, and efforts to solve it so sapped the energies
and minds of Allied analysts that the suggestion was made that the problem be dropped over
Germany, as the ultimate instrument of intellectual sabotage.”
The problem was eventually solved by British mathematicians, after all. The breakthrough
was made by Gittins and Jones in 1972, who first read their result at the European Meeting
of Statisticians with proceedings published in 1974 [87]. The result, however, became widely
known only after the publication of Gittins’ 1979 paper [88].

1 This Monty Python-styled story was told by Peter Whittle, a Churchill Professor of Mathematics for Operational
Research at the University of Cambridge, in the discussion section of Gittins’s 1979 paper [88].
4 1. INTRODUCTION
Referred to as the Gittins index theorem, the result states that the optimal policy to the
bandit problem exhibits a structure of strong decomposability: a priority index, depending solely
on the characterizations of each individual arm, can be attached to each state of each arm, and
playing the arm with the currently greatest index is optimal. Such an index policy, by fully
decoupling the arms, reduces an N -dimensional problem to N independent one-dimensional
problems, consequently reducing the computational complexity from an exponential order to a
linear order in the number of arms.
This breakthrough generated a flurry of activities in two directions, as fittingly de-
scribed by Whittle as exploitation—seeking proofs that provide additional insights and different
perspectives—and exploration—examining to what extent this class of MDP can be generalized
while preserving the index theorem. We highlight below a couple of celebrated results, leaving
the detailed discussion to Chapters 2 and 3.
One notable result in the pursuit of a better understanding was the proof by Weber in
1992 [208], which is “expressible in a single paragraph of verbal reasoning” as described by
Whittle in his foreword to the second edition of Gittins’ monograph with the addition of two
coauthors, Glazebrook and Weber, 2011, [90]. Such a result makes one appreciate better the
quote by Richard Feynman: “If a topic cannot be explained in a freshman lecture, it is not yet
fully understood.”
Along the direction of exploration, a significant extension is the restless bandit problems
posed and studied by Whittle in 1988 [212]. This class of more general MDP problems allows
for system dynamics that cannot be directly controlled. More specifically, the state of an arm may
continue to evolve when it is not engaged. In the original bandit model for sampling unknown
processes, a passive arm, with no new observations to update its information state, remains
frozen. Under the more general MDP model, however, the state of an arm may represent a
certain physical state that continues to evolve regardless of the decision maker’s action. For
instance, the quality of a communication channel may change even when it is not accessed;
queues continue to grow due to new arrivals even when they are not served; targets continue to
move even when they are not monitored.
Gittins index policy loses its optimality for restless bandit problems. Based on a La-
grangian relaxation of the problem, Whittle proposed an index policy, known as the Whittle
index. It was later shown by Weber and Weiss in 1990 [206] that the Whittle index policy is
asymptotically (as the number of arms approaching infinity) optimal under certain conditions.
In the finite regime, the optimality of whittle index has been established in a number of special
cases motivated by specific engineering applications, and the strong performance of whittle in-
dex has been observed in extensive numerical studies. The optimal solution to a general restless
bandit problem remains open.
1.3. TWO FORMULATIONS: BAYESIAN AND FREQUENTIST 5
1.3.2 THE FREQUENTIST FRAMEWORK
Within the frequentist framework, since the performance of a strategy in general depends on
the unknown parameters fi g, an immediate question is how to compare strategies given that,
in general, there does not exist a strategy that dominates all other strategies for all values of fi g.
Two approaches have been considered. The first one, referred to as the uniform-dominance
approach, restricts the set of admissible strategies to uniformly good policies: policies that offer
“good” (to be made precise later) performance for all fi g. This excludes from consideration those
heavily biased strategies, such as one that always plays arm 1 and offers the best possible return
when 1 is indeed the largest. Among thus defined admissible strategies, it is then possible to
compare performance as a function of fi g and define the associated optimality.
The second approach is the minimax approach: every strategy is admissible, but the per-
formance of a strategy is measured against the worst possible fi g specific to this strategy and the
horizon length T . The worst-case performance of a strategy no longer depends on fi g. Mean-
ingful comparison of all strategies can be carried out, nontrivial bounds on the best achievable
performance can be characterized, and optimality can be defined. This approach can also be
viewed through the lens of a two-player zero-sum game where fi g are chosen by an opponent.
Under both approaches, the main questions of concern are whether the maximum return
enjoyed by an omniscient player (the oracle, so to speak) with prior knowledge of fi g can be
approached via online learning, and if yes, what the fastest rate of convergence to this maximum
return would be. With fi g known, the oracle would obviously be playing the arm with the
largest mean value at all time and obtain the maximum expected reward of maxi fi g per play.
The first attempt at answering the above questions was made by Robbins in 1952 [168],
in which he considered a two-armed bandit problem and provided a positive answer to the
first question. To address the second question on the convergence rate, in 1985, Lai and Rob-
bins [126] proposed a finer performance measure of regret, defined as the expected cumulative
loss over the entire horizon of length T with respect to the oracle. All online learning strategies
with a regret growing sublinearly in T have a diminishing reward loss per play and asymptot-
ically achieve the maximum average reward of maxi fi g as T approaches infinity. The specific
sublinear regret growth rates, however, differentiate them in their effectiveness of learning.
In this much celebrated work, Lai and Robbins, 1985 [126] took the uniform-dominance
approach by restricting admissible strategies to consistent policies. They showed that the min-
imum regret feasible among consistent policies has a logarithmic order in T and constructed
asymptotically optimal policies for several reward distributions. Following this seminal work,
simpler sample-mean based index-type policies were developed by Agrawal, 1995 [4] and Auer,
Cesa-Bianchi, and Fischer, 2002 [23] that achieve the optimal logarithmic regret order under
different conditions on the reward distributions. A couple of open-loop strategies with a prede-
termined control of the exploration-exploitation tradeoff have also been shown to be sufficient
to offer order optimality. These simple strategies, less adaptive to random observations, are re-
6 1. INTRODUCTION
cently shown to have advantages over fully adaptive strategies when risk is factored into the
performance measure (see Section 5.4.1).
Early results under the minimax approach can be traced back to the work by Vogel in 1960,
who studied a two-armed bandit with Bernoulli rewards. It was shown by Vogel, 1960b p [204]
that the minimum regret growth rate under the minimax approach is of the order . T /. This
fundamental result, established more than two decades earlier than its counterpart under the
uniform-dominance approach by Lai and Robbins in 1985 seems to have largely escaped the
attention of the research community. One notable exception is the detailed and more general
coverage in Chapter
p 9 of the book by Berry and Fristedt, 1985 [33]. In most of the literature,
however, the . T / lower bound on the minimax regret is credited to much later studies ap-
peared in the 2000s.
Stemming from these foundational results, there is an extensive and fast-growing liter-
ature, in both theory and applications, on the frequentist bandit model and its variant of ad-
versarial bandits. Chapters 4 and 5 give a detailed coverage of both classical results and recent
development.

1.4 NOTATION
Notations used in this book are relatively standard. Random variables and their realizations are
denoted by capital and lowercase letters, respectively. Vectors and matrixes are in bold face. Sets
are in script style, with R denoting the set of real numbers.
Probability measure and expectation are denoted by P Œ and EŒ, respectively. The indi-
cator function is denoted by IŒ.
The technical treatment mostly deals with discrete time, in which case, time starts at t D 1.
For continuous time, the origin is set at t D 0.
We follow the Bachmann–Landau notation of O./ and ./ for describing the limiting
behavior of a function. Throughout the book, the terms strategy, policy, and algorithm are used
interchangeably.
7

CHAPTER 2

Bayesian Bandit Model and


Gittins Index
In this chapter, we formulate the Bayesian bandit problem as a Markov decision process (MDP).
We then devote our attention to the definition, interpretations, and computation of the Gittins
index and the index theorem establishing the optimality of the Gittins index policy.

2.1 MARKOV DECISION PROCESSES


We briefly review basic concepts of MDP. We restrict our attention to discrete-time MDPs
with countable state space to illustrate the basic formulation and solution techniques. Readers
are referred to texts by Ross, 1995 [171] and Puterman, 2005 [164] for more details.

2.1.1 POLICY AND THE VALUE OF A POLICY


The state S.t / 2 S of a process is observed at discrete time instants t D 1; 2; : : :, where S is
referred to as the state space. Based on the observed state, an action a 2 A is chosen from the
action space A. If the process is in state S.t / D s and action a is taken, then the following two
events occur.

• A reward r.s; a/ is obtained.

• The process transits to state s 0 2 S with probability p.s; s 0 I a/.

This decision process is called an MDP. The objective is to design a policy that governs the
sequential selection of actions to optimize a certain form of the rewards accrued over a given
time horizon.
A policy  is given by a sequence of decision rules d t , one for each time t :

 D .d1 ; d2 ; : : :/: (2.1)

Based on the information used for selecting the actions, a decision rule can be either Markov
or history dependent: a Markov decision rule d t depends on only the state of the process at
time t ; a history-dependent decision rule uses the entire history of the process (both the states
and the actions) up to t . Based on how the chosen action is specified, a decision rule can be
either deterministic or randomized. The former returns a specific action to be taken. The latter
8 2. BAYESIAN BANDIT MODEL AND GITTINS INDEX
determines a probability distribution on A based on which a random action will be drawn. Thus,
a Markov deterministic decision rule is a mapping from S to A. A Markov randomized decision
rule is a mapping from S to the set of probability distributions on A. A policy is Markov if d t is
Markov for all t . A policy is deterministic if d t is deterministic for all t . A policy is stationary if it
is Markov and employs the same decision rule for all t . Under a stationary policy, the sequence of
the induced states fS.t/g t 1 forms a homogeneous Markov process. The sequence of the states
and the rewards fS.t /; r.S.t/; d t .S.t//g t 1 induced by a Markov policy  D .d1 ; d2 ; : : :/ is a
Markov reward process.
We are interested in policies that are optimal in some sense. Commonly adopted optimal-
ity criteria include the total expected reward over a finite horizon of length T , the total expected
discounted reward over an infinite horizon, and the average reward over an infinite horizon.
Under each of these three criteria, the value of a policy  for an initial state s is given as follows.
" T ˇ #
X ˇ
ˇ
Total reward criterion: V .s/ D E r.S.t/; a.t// ˇ S.1/ D s ; (2.2)
ˇ
t D1
"1 ˇ #
X ˇ
ˇ
Discounted reward criterion: V .s/ D E ˇ t 1 r.S.t/; a.t // ˇ S.1/ D s ; (2.3)
ˇ
t D1
" T ˇ #
1 X ˇ
ˇ
Average reward criterion: V .s/ D lim inf E r.S.t/; a.t // ˇ S.1/ D s ; (2.4)
T !1 T ˇ
t D1

where a.t / is the action taken at time t under  , E represents the conditional expectation
given that policy  is employed, and ˇ (0 < ˇ < 1) is the discount factor. The discounted reward
criterion is economically motivated: a reward to be earned in the future is less valuable than one
earned at the present time.
With the value of a policy defined, the value of an MDP with an initial state s is given by
V .s/ D sup V .s/: (2.5)


A policy   is optimal if, for all s 2 S ,


V  .s/ D V .s/: (2.6)
We shall assume the existence of an optimal policy (see, for example, Puterman, 2005 [164] on
sufficient conditions that guarantee the existence of an optimal policy). It is known that if an
optimal policy exists, then there exists an optimal policy which is Markov and deterministic.
The sufficiency of Markov policies for achieving optimality can be seen from the fact that the
immediate rewards and the transition probabilities depend on the history only through the cur-
rent state of the process. The sufficiency of deterministic policies can be easily understood since
the return of a random action is a weighted average over the returns of all actions, which cannot
exceed the maximum return achieved by a specific action. Based on this result, we can restrict
our attention to Markov and deterministic policies.
2.1. MARKOV DECISION PROCESSES 9
2.1.2 OPTIMALITY EQUATION AND DYNAMIC PROGRAMMING
It is instructive to consider first the finite-horizon problem in (2.2). We present a technique,
known as backward induction, for solving for the optimal policy   recursively in time, starting
from the last decision period T .
In decision period t , given the current state S.t / D s , irrespective of past actions and
rewards, the optimal decision rules for the remaining time horizon are those that maximize the
expected total reward summed over t to T . In other words, we face a “new” finite-horizon MDP
with a horizon length of T t C 1 and an initial state s .
Let Vn .s/ (n D 1; : : : ; T ) denote the value of the n-stage MDP with an initial state s . We
can solve for the value function VT .s/ and the corresponding optimal policy   D .d1 ; : : : ; dT /
recursively in terms of the horizon length starting from n D 1. This can be viewed as going
backward in time starting from the last decision period T which faces a simple one-stage decision
problem. Specifically, at t D T , we readily have, for all s 2 S ,
V1 .s/ D max r.s; a/: (2.7)
a2A

The optimal decision rule for this stage maps from each state s 2 S to an action a .s/ that
achieves the above maximum value:
a .s/ D arg max r.s; a/: (2.8)
a2A

Now consider an n-stage problem with an initial state s . If action a is taken initially, then
a reward of r.s; a/ is immediately accrued and the state transits to s 0 with probability p.s; s 0 I a/.
We are then facing an .n 1/-stage problem, from which the maximum total remaining reward
is given by Vn 1 .s 0 / under a realized state transition to s 0 . Thus, if we take action a initially and
then follow the optimal decision rules for the subsequent .n 1/-stage problem, the total return
is X
r.s; a/ C p.s; s 0 I a/Vn 1 .s 0 /:
s 0 2S

The optimal initial action needs to strike a balance between the above two terms: the result-
ing immediate reward r.s; a/ and the impact on future state evolutions (hence future values)
characterized by fp.s; s 0 I a/g. The value for the n-stage problem is thus given by
( )
X
Vn .s/ D max r.s; a/ C p.s; s 0 I a/Vn 1 .s 0 / : (2.9)
a2A
s 0 2S

The optimal decision rule can be obtained by finding an action attaining the above maximum
value for each s 2 S .
Equation 2.9 is known as the optimality equation or the dynamic programming equation.
The above also gives a numerical technique, referred to as backward induction, for solving for the
T -stage value function VT .s/ and the optimal policy   D .d1 ; : : : ; dT / recursively.
10 2. BAYESIAN BANDIT MODEL AND GITTINS INDEX
Consider next the criterion of discounted reward given in (2.3). The value function V .s/
satisfies, for all s 2 S , the following optimality equation:
( )
X
0 0
V .s/ D max r.s; a/ C ˇ p.s; s I a/V .s / : (2.10)
a2A
s 0 2S

The above functional equation of V .s/ can be understood in a similar way as (2.9). After taking
action a under the initial state s at t D 1, we face, at t D 2, the same infinite-horizon discounted
MDP problem, except with a potentially different initial state s 0 and an additional multiplicative
factor ˇ for all rewards (due to a starting point of t D 2). The maximum total discounted reward
under this initial action a is thus given by the sum of the immediate reward r.s; a/ at t D 1 and
a weighted average of V .s 0 / times ˇ that represents the optimal return from t D 2 onward with
the weight given by the probability of seeing state s 0 at t D 2. Optimizing over the initial action
a gives us V .s/.
The following theorem summarizes several important results on MDP with discounted
rewards. We assume here the rewards fr.s; a/g are bounded for all s and a.
Theorem 2.1 MDP with Discounted Rewards:
1. There exists an optimal policy which is stationary.
2. The value function is the unique solution to the optimality equation (2.10).
3. A policy is optimal if and only if, for all s 2 S , it chooses an action that achieves the maximum
on the right-hand side of (2.10).

As defined in Section 2.1.1, a stationary policy employs the same decision rule in all
decision periods:  D Œd; d; : : :. We can thus equate a stationary policy with its decision rule
and directly consider a stationary policy  as a mapping from the state space S to the action
space A. Theorem 2.1-1 states that stationary policies suffice for achieving optimality for MDPs
under the infinite-horizon discounted-reward criterion.
Several solution techniques exist for solving for or approximate the value V .s/ and an
optimal policy. We present here the method of value iteration, also referred to as successive ap-
proximation. This method can be viewed as computing the infinite-horizon value V .s/ as the
limit of the finite-horizon values Vn .s/ as the horizon length n tends to infinity. Following (2.7)
and (2.9) and incorporating the discounting, we obtain the following iterative algorithm for
computing an  -optimal policy  whose value is within  of V .s/ for all s 2 S .
The initial value V1 .s/ can be set to arbitrary values, not necessarily the value of the one-
stage problem as given in Algorithm 2.1. The global convergence of the value iteration algorithm
is guaranteed by the fixed point theorem under the contraction mapping given in (2.10) where
the contraction is due to the discounting of ˇ < 1. For details, see, for example, Puterman,
2005 [164].
2.2. THE BAYESIAN BANDIT MODEL 11
Algorithm 2.1 Value Iteration
Input:  > 0.

1: Initialization: set TERMINATE D 0, n D 1, and

V1 .s/ D max r.s; a/:


a2A

2: while TERMINATE D 0 do
3: n D n C 1.
4: Compute, for all s 2 S ,
( )
X
Vn .s/ D max r.s; a/ C ˇ p.s; s 0 I a/Vn 1 .s
0
/ :
a2A
s 0 2S

5: if maxs jVn .s/ Vn 1 .s/j <  then


6: TERMINATE D 1.
7: Compute, for all s 2 S ,
( )
X
 .s/ D arg max r.s; a/ C ˇ 0
p.s; s I a/Vn .s / : 0
a2A
s 0 2S

8: end if
9: end while

2.2 THE BAYESIAN BANDIT MODEL


Gittins and Jones, 1974 [87] studied the following sequential decision model, which generalizes
the Bayesian formulation of multi-armed bandit problems.

Definition 2.2 The Multi-Armed Bandit Model:


Consider N arms, each with state space Si (i D 1; : : : ; N ). At time t D 1; 2; : : :, based on the
observed states ŒS1 .t/; : : : ; SN .t/ of all arms, one arm is selected for activation. The active arm,
say arm i , offers a reward ri .Si .t // dependent of its current state and changes state according to
a transition law pi .s; s 0 / (s; s 0 2 Si ). The states of all passive arms remain frozen. The objective
is an arm activation policy that maximizes the expected total discounted reward with a discount
factor ˇ (0 < ˇ < 1).
12 2. BAYESIAN BANDIT MODEL AND GITTINS INDEX
The MDP formulation of the bandit problem is readily seen. The state space S is the
Cartesian product of the state spaces of all arms:

S D S1  S2  : : : SN : (2.11)

The action space consists of the indexes of the N arms: A D f1; 2; : : : ; N g. The reward and
transition probabilities under state Œs1 ; : : : ; sN  and action a are given by

r.Œs1 ; : : : ; sN ; a/ D ra .sa /; (2.12)



0 0 pa .sa ; sa0 / if si D si0 for all i ¤ a
p.Œs1 ; : : : ; sN ; Œs1 ; : : : ; sN I a/ D : (2.13)
0 otherwise

In the example below, we show that the first bandit problem posed by Thompson,
1933 [190] falls under the above MDP formulation, except that Thompson adopted the total
reward criterion over a finite horizon.

Example 2.3 The First Bandit Problem and Thompson Sampling:


Consider two Bernoulli-distributed arms with unknown parameters ‚1 and ‚2 . Suppose that
‚1 and ‚2 are independent with a uniform distribution over Œ0; 1. The objective is to maximize
the expected total reward over a given horizon by sequentially selecting one arm to play at each
time.
The MDP formulation of this bandit problem is as follows. At time t , the state Si .t/ of
arm i is the posterior distribution of ‚i given past observations. In other words, the state Si .t/
takes the form of a non-negative real-valued function with support on Œ0; 1. In particular, the
initial state of each arm is a constant function of 1 on Œ0; 1. Given its current state Si .t/ D f . /
(a probability density function), the reward and the state transition of an active arm are given as

ri .f / D Ef Œ‚i ; (2.14)
( f . /
Ef Œ‚i 
with probability Ef Œ‚i 
Si .t C 1/ D .1 /f . / ; (2.15)
1 Ef Œ‚i 
with probability 1 Ef Œ‚i 
R1
where Ef Œ‚i  D 0 f . /d is the expected value of ‚i under distribution f . The state tran-
sition follows the Bayes rule: the updated posterior distribution takes two possible forms, de-
pending on whether 1 is observed (with probability Ef Œ‚i ) or 0 is observed (with probability
1 Ef Œ‚i ) after playing arm i at time t . In other words, for an active arm i , the state transition
law is given by
 
f . /
pi f ./; D Ef Œ‚i ; (2.16)
 Ef Œ‚i  
.1  /f . /
pi f ./; D 1 Ef Œ‚i : (2.17)
1 Ef Œ‚i 
2.3. GITTINS INDEX 13
While the MDP formulation of the Bayesian bandit problem is clear, the resulting MDP
model is computationally prohibitive due to the complexity of the state space. Thompson pro-
posed a heuristic randomized stationary policy, later known as Thompson Sampling. The policy
is as follows. At time t , given the current states .f1 .1 /; f2 .2 // of the two arms, the probability
q.f1 .1 /; f2 .2 // that arm 1 is better than arm 2 is computed:

q.f1 .1 /; f2 .2 // D P Œ‚1 > ‚2 j ‚1  f1 .1 /; ‚2  f2 .2 / (2.18)
Z 1 Z 1
D f1 .1 /f2 .2 /d1 d2 : (2.19)
1 D0 2 D0

A randomized action is then generated that plays arm 1 with probability q.f1 .1 /; f2 .2 // and
plays arm 2 with probability 1 q.f1 .1 /; f2 .2 //. An equivalent and computationally simpler
implementation is to draw two random values of 1 and 2 according to their posterior distri-
butions (i.e., their current states) f1 .1 / and f2 .2 /, respectively, and then play arm 1 if 1 > 2
and play arm 2 otherwise.
Based on the realization of the random action and the resulting random observation, the
state of the chosen arm is updated. The next action is then chosen under the same randomized
decision rule.

The above example shows that the original bandit problem posed by Thompson is a special,
albeit complex, case of the MDP model specified in Definition 2.2. They are special in the sense
that the state of each arm is a probability distribution and the state transitions follow Bayes
rule. The general model given in Definition 2.2 is now referred to as the multi-armed bandit
problem, while the original Bayesian bandit as posed by Thompson is often referred to as the
bandit sampling processes (see Gittins, Glazebrook, and Weber, 2011 [90]).

2.3 GITTINS INDEX


Given that the bandit model is a special class of MDP problems, standard methods for solving
MDP directly apply. The computational complexity of such methods, being polynomial in the
size of the state space, would grow exponentially with N since the state space of the resulting
MDP is the Cartesian product of the state spaces of all arms. Whether this special class of
MDP exhibits sufficient structures to allow simple optimal solutions was a classical challenge
that intrigued the research community for decades. It was until early 1970s that Gittins and Jones
revealed an elegant solution, almost 40 years after Thompson posed the first bandit problem.

2.3.1 GITTINS INDEX AND FORWARD INDUCTION


Gittins and Jones, 1974 [87] and Gittins, 1979 [88] showed that an index policy is optimal for
the bandit model. Specifically, there exists a real-valued function i that assigns an index i .s/
to each state s of each arm i , and the optimal policy is simply to compare the indexes of all the
14 2. BAYESIAN BANDIT MODEL AND GITTINS INDEX
arms at their current states and play the one with the greatest index. The state of the chosen arm
then evolves. The index of its new state at the next time instant is compared with the other arms
(whose states, hence index values, are frozen) to determine the next action.
The optimality of such an index-type policy reveals that there exists a “figure of merit”
that fully captures the value of playing a particular arm at a particular state. What is remark-
able is that the index function i of arm i depends solely on the Markov reward process
fpi .s; s 0 /; ri .s/gs;s 0 2Si that defines arm i . That is, in computing the optimal policy, the N arms
are decoupled and can be treated separately. The computational complexity is thus reduced from
being exponential to being linear with N .
This index, known as Gittins index, has the following form:
P ˇ 
E tD1 ˇ
t 1
ri .Si .t// ˇ Si .1/ D s
i .s/ D max P ˇ  ; (2.20)
t 1 ˇ S .1/ D s
E t D1 ˇ
1 i

where  is a stopping time.


To gain an intuitive understanding of the above expression, let us first ignore the max-
imization over  and set  D 1. In this case, the right-hand side of (2.20) becomes ri .s/, the
immediate reward obtained by playing arm i at its current state s . If we define the index in this
way for all states of all arms, the resulting index policy is simply a myopic policy that considers
only the present with no regard of the future.
Myopic policies are suboptimal in general, since a state with a low immediate reward but
a high probability to transit to highly rewarding states in the future may have more to offer. A
straightforward improvement to the myopic policy is the so-called T -step-look-ahead policies
by defining an index with  D T in the right-hand side1 of (2.20). As T increases, the per-
formance of a T -step-look-ahead policy improves, but at the cost of increasing computational
complexity (often exponential in T ).
A further improvement is to look ahead to an optimally chosen random time  dependent
on the evolution of the state rather than a pre-fixed T steps into the future. More specifically,
 is a stopping time defined on the Markov process of the arm states fSi .t /g t 1 . An example
stopping time Q is the hitting time of a subset Q of states (Q  Si ), that is, the first time the
Markov chain enters the set Q (referred to as the stopping set):

Q D min ft W Si .t / 2 Qg : (2.21)

As will be made clear later, it is sufficient to consider only this type of stopping times (for all
possible stopping sets Q  Si ) in the maximization of (2.20). This extension to T -step-look-
ahead is referred to as forward induction, in contrast to the backward induction discussed in
1 The commonly adopted definition of a T -step-look-ahead policy corresponds to an index measuring the expected total
discounted reward accrued over T decision periods, i.e., the numerator of the right-hand side of (2.20). For a deterministic
T , however, the denominator in (2.20) is a constant independent of i and s , thus inconsequential in action selection based on
the index.
2.3. GITTINS INDEX 15
Section 2.1.2. The numerator of the right-hand side of (2.20) is the total discounted reward
obtained up to the stopping time  , and the denominator is the total discounted time to reach 
from an initial state s . The index i .s/ can thus be interpreted as the maximum rate (by choosing
the optimal stopping time) at which rewards can be accrued from arm i starting at state s . This
notion of maximum equivalent reward rate will be made more rigorous in the next subsection.
The following example, based on one in the monograph by Gittins, Glazebrook, and We-
ber, 2011 [90], is perhaps the most illuminating in understanding the index form given in (2.20).
Example 2.4 Consider N biased coins, each with a known bias i (i D 1; : : : ; N ). We are
allowed to toss each coin once, and receive a unit reward for each head tossed. In what order
should the coins be tossed in order to maximize the expected total discounted reward with a
discount factor ˇ (0 < ˇ < 1)?
The answer is rather obvious. Without discounting, every ordering of coin tossing gives
the same expected total reward. With discounting, the coins should be tossed in decreasing order
of i to minimize reward loss due to discounting.
Now consider N stacks of coins, each consisting of an infinite number of biased coins. Let
i;j denote the known bias of the j th coin in the i th stack. At each time, we choose a stack, toss
the coin at the top, and remove it from the stack. The objective remains the same: to maximize
the expected total discounted reward.
We can quickly recognize this problem as a bandit problem with each stack of coins being
an arm and the bias of the coin currently at the top of the stack being the current state of the
corresponding arm. This is, however, a much simpler version of the problem in the sense that the
state of an active arm evolves deterministically. More specifically, the reward process associated
with each arm/stack i is a known deterministic sequence fi;j gj 1 .
Consider first that for each stack i , the coin biases are monotonically decreasing (i;j 
i;j C1 for all j ). The optimal strategy is a myopic one: flip the top coin with the greatest bias
among the N stacks. Such a strategy interleaves the given N monotonic reward sequences into a
single monotonic sequence, thus minimizing reward loss due to discounting. This can be readily
seen by drawing a contradiction: if the resulting sequence of rewards is not monotonically de-
creasing, then at least one greater value of reward is more heavily discounted in place of a smaller
reward, resulting in a smaller total discounted reward.
If, however, the coin biases are not monotone, the myopic strategy is in general suboptimal,
and the optimal policy is not immediately obvious. Applying the Gittins index in (2.20), we
obtain the following strategy. For each stack i , we compute the following index (for simplicity,
we relabel coins from top to bottom starting at 1 after each coin removal):
P j 1
j D1 ˇ i;j
i D max P j 1
; (2.22)
j D1 ˇ
1

where the maximization is over all positive integers. Let i denote the positive integer that
achieves the maximum of the right-hand side above. The index i can be seen as the maximum
16 2. BAYESIAN BANDIT MODEL AND GITTINS INDEX
reward rate achieved by tossing, consecutively, the top i coins in the i th stack. The Gittins
index policy is to choose the stack, say k , whose index k is the greatest, toss and remove the
top k coins in this stack, recompute the index for stack k , and repeat the process.
One may immediately notice a discrepancy in the implementation of Gittins index policy
described here and that given at the beginning of this subsection. Instead of recomputing and
recomparing the indexes after each coin toss, the top k coins in the currently chosen stack are
tossed consecutively without worrying that the index of this stack may drop below those of other
stacks after some of these k coins have been tossed and removed. We can show that these two
implementations are equivalent. This is due to a monotone property of the Gittins index. In
particular, all the k 1 coins immediately following the top coin in stack k have indexes no
smaller than that of the top coin. This can be intuitively explained as follows. Had a coin among
these k 1 coins had a smaller index (i.e., a lower reward rate), a higher reward rate of the top
coin would have been achieved by stopping before tossing this “inferior” coin. This contradicts
with the definition of the index k and the optimal stopping time k . This monotone property
associated with the Gittins index, which also plays a central role in developing numerical al-
gorithms for computing Gittins index, will be made precise in Property 2.5 in Section 2.3.3.

2.3.2 INTERPRETATIONS OF GITTINS INDEX


Before formally stating and proving the optimality of the Gittins index policy, we discuss below
several interpretations of the Gittins index to gain insights from different perspectives. Each of
these interpretations led to a major proof of the index theorem in the literature, documenting a
collective pursuit of understanding spanning multiple decades since Gittins’s original proposal
of the index theorem.

Calibration with a standard arm: The role of the index in an index-type policy is to provide a
comparative measure for deciding which arm to activate. One way to compare the arms without
considering them jointly is to compare each arm with a standard one that offers rewards at a
constant rate. This standard arm serves the role of calibration.
Consider arm i and a standard arm that offers a constant reward  each time it is activated.
At each time, we may choose to play either arm i or the standard arm. The objective is to
maximize the expected total discounted reward. This decision problem is often referred to as the
1:5-arm bandit problem.
The above problem is an MDP with state space Si . Based on Theorem 2.1, there exists
an optimal policy which is stationary (in addition to being deterministic and Markov). Such an
optimal policy, being a mapping from Si to the two possible actions, partitions the state space
Si into two disjoint subsets. Let Qi ./ denote the set of states in which playing the standard
arm is optimal. The set Si n Qi ./ thus contains states in which playing arm i is optimal.
2.3. GITTINS INDEX 17
We first note that the optimal policy for choosing which arm to play can also be viewed
as an optimal stopping rule for quitting playing arm i . The reason is that once the state of arm i
enters Qi ./ for which the standard arm is chosen, the state of arm i stays frozen thus remains
in Qi ./. Consequently, the standard arm will be chosen for the rest of the time horizon. Thus,
the optimal policy specifies an optimal stopping time for playing arm i , which is given by the
hitting time of Qi ./. The dependency of the optimal stopping set Qi ./ on  is quite obvious:
when the reward rate  of the standard arm is sufficiently small ( ! 1), it is never optimal
to play the standard arm, and Qi ./ D ;; when  is sufficiently large, we should always play the
standard arm, and Qi ./ D Si .
Consider a specific initial state Si .1/ D s of arm i . A critical value of  is given by the
case when it is equally optimal to play the standard arm all through (i.e., to include s in Qi ./)
or to play arm i initially until an optimal stopping time given by the hitting time Qi ./ with s
excluded from Qi ./. In other words, playing arm i starting from the initial state s up to Qi ./
produces the same expected total discounted reward as a standard arm with a constant reward
rate . This critical value of  is thus an obvious candidate for the index of s .
Next we show that this definition of the index leads to the characterization given in (2.20).
At the critical value of , we have, by definition,
"  ˇ #
 X
t 1   ˇˇ
D max E ˇ ri .Si .t // C ˇ ˇ Si .1/ D s ; (2.23)
1 ˇ 1
t D1
1 ˇ ˇ

where the left-hand side is the total discounted reward for playing the standard arm all through,
and the right-hand side is the total discounted reward for playing arm i initially until an optimal
stopping time and then switching to the standard arm. Equivalently, for any stopping time  > 1,
we have "  ˇ #
 X  ˇ
ˇ
E ˇ t 1 ri .Si .t // C ˇ  ˇ Si .1/ D s ; (2.24)
1 ˇ t D1
1 ˇ ˇ

which leads to P ˇ 


E tD1 ˇ t 1 ri .Si .t// ˇ Si .1/ D s
 (2.25)
E Œ1 ˇ  j Si .1/ D s =.1 ˇ/
with equality for the optimal stopping time. The characterization of the Gittins index given
in (2.20) thus follows.
Calibrating arm i by a standard arm shows that the index i .s/ given in (2.20) represents
the maximum equivalent constant reward rate offered by arm i with an initial state s . It also
shows that it suffices to consider in (2.20) only the hitting times of all possible subsets of Si in
the maximization over all stopping times  due to the sufficiency of stationary policies for the
1.5-arm bandit problem. Recall that Qi ./ denotes the set of states of arm i in which playing
the standard arm with reward rate  is optimal. The stopping time  that attains the maximum
in (2.20) is thus the hitting time of set Qi .i .s//, where we adopt the convention that arm i is
18 2. BAYESIAN BANDIT MODEL AND GITTINS INDEX
chosen over the standard arm when the two actions are equally optimal (thus s is excluded from
Qi .i .s// and the constraint of  > 1 in the right-hand side of (2.20) is satisfied). We simplify
the notation of Qi .i .s// to Qi .s/ for the stopping set that attains the Gittins index of state s
of arm i .
The 1:5-arm bandit problem and the idea of calibration using a known arm trace back
to the work by Bradt, Johnson, and Karlin, 1956 [43], who considered the problem of play-
ing an unknown Bernoulli arm against a known one. The finite-horizon total-reward criterion
was adopted. They showed that the optimal action at each time is determined by comparing
the success rate of the known arm with a critical value attached to the posterior probability of
success (i.e., the state) of the unknown arm. This critical value, however, is also a function of
the remaining time horizon due to the finite-horizon formulation. Bellman, 1956 [30], con-
sidered the same problem under the infinite-horizon discounted-reward criterion, adopting a
beta-distributed prior for the unknown arm. The existence of an index form that characterizes
the optimal policy was established using the technique of value iteration (although the term
“index” was not introduced there). The analytical form of the index, however, was not obtained,
and Bellman remarked in the conclusion that “it seems to be a very difficult problem...”
Gittins index as the fair charge: In the proof of the index theorem given by Weber, 1992 [208]
the Gittins index was interpreted as a fair charge for playing an arm.
Consider a single arm i . At each decision time, if we decide to play the arm (and collect
the reward offered by the arm), a fixed charge , referred to as the prevailing charge must be paid.
If the charge  is too great, it will not be worthwhile to play the arm at all; the optimal profit
we can gain is 0. If the charge is sufficiently small, the optimal strategy would be to play arm i
until an optimal stopping time for a positive profit.
Define the fair charge as the critical value of the prevailing charge for which it is equally
optimal to not play at all and play until an optimal stopping time. In other words, under the fair
charge, the best possible outcome is to break even, and the game thus defined is a fair game.
Consider a given initial state s of the arm. Let i .s/ denote the fair charge for state s . The total
discounted reward obtained from playing the arm up to a stopping time  is upper bounded by
the total discounted charges paid:
"  # "  #
X X
t 1 t 1
i .s/E ˇ j Si .1/ D s  E ˇ ri .Si .t // j Si .1/ D s ; (2.26)
tD1 tD1

where equality holds for the optimal stopping time (i.e., when we play optimally and break even).
The same characterization of the index in (2.20) thus follows.
It is not difficult to see the equivalence between the 1:5-arm bandit problem and this
single-arm bandit with a prevailing charge. If we subtract a constant  from the rewards offered
by the standard arm and by arm i at every state, the 1:5-arm bandit problem remains math-
ematically the same. In this case, the reward ri .s/  of playing arm i at any state s can be
considered as the profit after paying a prevailing charge of , and playing the standard arm, now
2.3. GITTINS INDEX 19
offering zero reward, can be considered as terminating the game. Viewing the index as the fair
charge, however, leads to a proof of the index theorem using simple verbal reasoning as detailed
in Section 2.4.
Gittins index as the equitable retirement reward: In Whittle’s proof of the index theorem,
1980 [210], he considered a single-arm bandit problem with an option of retiring. At each
decision time, we can either play arm i or retire with a lump-sum retirement reward of u. For a
given initial state s of arm i , the index of this state is defined as the equitable retirement reward
for which it is equally optimal to retire immediately or to play the arm until an optimal stopping
time and then retire.
The equivalence between the 1:5-arm bandit problem and the single-arm bandit with a
retirement option is fairly obvious: playing the standard arm is equivalent to retiring except
that the retirement reward is paid over an infinite horizon at a constant rate of  rather than
a lump-sum amount of u paid in full at the time of retirement. It is easy to see that a reward
rate of  D u.1 ˇ/ is equivalent to a lump-sum amount of u. As a result, the index defined as
the equitable retirement reward differs from the one in (2.20) by a constant factor of .1 ˇ/.
The resulting index policies, however, are identical, since the actions determined by the indexes
depend only on the relative order, not the exact values, of the indexes.
Gittins index as the maximum return with restarting: Katehakis and Veinott, 1987 [112]
characterized the index based on a restart-in-state formulation of an arm. This characterization
leads to an efficient online algorithm for computing the index, as will be discussed in Section 2.5.
Consider arm i with initial state s . Suppose that at each subsequent decision time t > 1,
we have the option of setting the arm state back to s , thus restarting the arm process. The problem
is to decide whether to restart at each time to maximize the expected total discounted reward.
Let Vi .s/ denote the value of this MDP. Note that if the action of restarting is taken at time
t , then the future value from that point on would be Vi .s/ discounted with ˇ t 1 . Optimizing
over the restarting time  , we arrive at
" 1 #
X
t 1  1
Vi .s/ D max E ˇ ri .Si .t// C ˇ Vi .s/ j Si .1/ D s ; (2.27)
>1
t D1

which leads to
P 
E ˇ t 1 ri .Si .t // j Si .1/ D s
t D1 i .s/
Vi .s/ D max 
D : (2.28)
1 1 EŒˇ j Si .1/ D s 1 ˇ
Thus, the index of state s can be defined as the maximum return from arm i with the restart
option. This definition leads to the same index value as the equitable retirement reward, with
both differing from the original definition in (2.20) by a constant factor of .1 ˇ/. The advantage
of this characterization is that Vi .s/, as the value function of an MDP, can be computed via
standard solution techniques for MDP. This leads to an algorithm for online computation of the
Gittins index as discussed in Section 2.5
20 2. BAYESIAN BANDIT MODEL AND GITTINS INDEX
2.3.3 THE INDEX PROCESS, LOWER ENVELOP, AND MONOTONICITY
OF THE STOPPING SETS
We discuss here a monotone property of the stopping sets that attain the Gittins index. This
property leads to a numerical method for computing the index, as will be discussed in Sec-
tion 2.5, as well as an alternative implementation of the index policy as we have seen in Exam-
ple 2.4. We then introduce the concepts of the index process and its lower envelope developed
by Mandelbaum, 1986 [142] in his proof of the index theorem. These concepts provide an in-
tuitive understanding of the index theorem. In the next section, we will see that the prevailing
charge process in Weber’s proof of the index theorem is precisely the lower envelope of the index
process.
Monotonicity of the stopping sets: The following property characterizes the stopping sets
that attain the Gittins index.
Property 2.5 Recall that Qi .s/ denotes the stopping set that attains the Gittins index i .s/ of
state s of arm i . We have
˚ 0 ˚
s W i .s 0 / < i .s/  Qi .s/  s 0 W i .s 0 /  i .s/ : (2.29)

Property 2.5 states that the optimal stopping time i .s/ that attains the index i .s/ of
state s is to stop once the arm hits a state with an index smaller than i .s/. We give an intuitive
explanation of the above property in place of a formal proof. Based on the derivation of the
index i .s/ from the 1:5-arm bandit problem, Qi .s/ is the set of states in which playing the
standard arm with a constant reward rate of i .s/ is optimal. It is thus easy to see that all states
with maximum equivalent constant reward rates (i.e., their Gittins indexes) smaller than i .s/
should be in Qi .s/. Similarly, all states with indexes greater than i .s/ should not be in Qi .s/.
States with the same index value as s , including s itself, can be included or excluded from Qi .s/
without affecting the index value.
A direct consequence of Property 2.5 is a monotone property of the optimal stopping sets
in terms of the index values they attain: if i .s 0 /  i .s/, then Qi .s 0 /  Qi .s/. This monotone
property leads to a numerical method for computing the Gittins index when the state space is
finite (see Section 2.5). Another consequence of Property 2.5 is an equivalent implementation
of the Gittins index policy as detailed below.
The usual implementation of the Gittins index is to compute the index of the current state
of each arm at each given time and play the arm with the greatest index. Property 2.5 shows that
the following implementation is identical. At t D 1, we identify the arm with the greatest index,
say it is arm i with an initial state s . We play arm i until the hitting time of Qi .s/. Based on
Property 2.5, this is the first time that the index of arm i drops below the index value i .s/ at
t D 1. We compute the index value of arm i at this time, compare it with the index values of all
other arms (which remain unchanged), and repeat the procedure.
2.3. GITTINS INDEX 21
The index process and its lower envelope: Property 2.5 implies that if we observe arm i only
at the optimal stopping time Qi .s/ that attains the index of the previously observed state s ,
the indexes of the sampled states form a decreasing sequence. We formalize this statement by
introducing the concepts of index process and the lower envelope of an index process.
The most salient feature of the bandit problem is that the player’s actions do not affect
the state evolution of any arm; they merely pause the evolution of an arm when it is not played.
More specifically, each arm is a fixed Markov reward process, and the decision problem is how
to interleave2 these N independent reward processes into a single reward process. The essence
of the problem is to interleave these reward processes in such a way that reward segments with
higher rates are collected earlier in time to mitigate discounting. This perspective on the bandit
problem is most clearly reflected in Example 2.4 where each arm (i.e., a coin stack) is a fixed
deterministic reward process and a policy merely decides, repeatedly, which segments of rewards
to collect next.
We then equate each arm i with a fixed stochastic reward process fri .Si .t//g t1 . The index
process fi .t/g t1 associated with arm i is the induced stochastic process of the Gittins index
values, i.e., i .t / D i .Si .t//. The lower envelope f i .t /g t 1 of the index process fi .t/g t1 is
given by the running minimum of the index values:

 i .t/ D min i .k/: (2.30)


kt

Figure 2.1 illustrates a sample path of an index process (the solid line) and its lower envelope
(the dashed line).
It is easy to see that each sample path of the lower envelope, as a running minimum, is
piecewise constant and monotonically decreasing. Each constant segment of the lower envelope
corresponds to the stopping time attaining the index value represented by this segment. We
explain this using the sample path illustration in Figure 2.1. The index value of the initial state s
determines the first entry in the index process as well as the lower envelope. In the subsequent
three time instants, the arm evolves to states with greater values of the index; the lower envelope
thus remains at the constant level of i .s/. At t D 5, the state s 0 has a smaller index. This ends
the first constant segment of the lower envelope, which now takes the value of i .s 0 / < i .s/.
Based on Property 2.5, this is also the stopping time for attaining the index value of i .s/. The
same argument applies to each constant segment of each sample path of the lower envelope.
The definition of the Gittins index in (2.20) applies to general stochastic reward pro-
cesses beyond Markovian. The property of the lower envelope process and its relation with the
optimal stopping times hold for general reward processes as well. As a result, the index theo-
rem holds without the Markovian assumption on the reward processes, which was first shown
by Varaiya, Walrand, and Buyukkoc, 1985 [200], and later crystalized in Mandelbaum’s analy-
sis (1986 [142]) within the framework of optimal stopping in a multi-parameter process where
2 Note that the interleaving of the N reward processes resulting from an arm selection policy is stochastic and sample-path
dependent in general.
22 2. BAYESIAN BANDIT MODEL AND GITTINS INDEX

Figure 2.1: The index process and the lower envelope.

time is replaced by a partially ordered set. The proof of the index theorem given in the next section
is also from the perspective that casts the bandit problem as interleaving N general stochastic
reward processes rather than an MDP.

2.4 OPTIMALITY OF THE GITTINS INDEX POLICY


We now formally state the index theorem and give the proof due to Weber, 1992 [208].

Theorem 2.6 Optimality of Gittins Index Policy:


Consider the multi-armed bandit model defined in Definition 2.2. Let the index i .s/ of every state s
of every arm i be defined as in (2.20). The index policy that, at each decision time t for given arm
states fsi .t/gN 
i D1 , plays an arm i with the greatest index i .si .t// D maxi D1;:::;N i .si .t// is op-
 

timal.

Since the original work by Gittins and Jones, 1974 [87] and Gittins, 1979 [88], that proved
the index theorem based on an interchange argument, a number of proofs using a diverse array
of techniques appeared over a span of two decades. The monograph by Gittins, Glazebrook, and
Weber, 2011 [90] gave a detailed account of these proofs. We include here the proof by Weber,
1992 [208], based on the fair charge interpretation of the index. We will see that the prevailing
2.4. OPTIMALITY OF THE GITTINS INDEX POLICY 23
charge used in Weber’s proof is precisely the lower envelope of the index process as discussed in
the previous subsection.

Proof. Consider first a single arm i with an initial state s0 . Recall that the Gittins index i .s0 /
can be interpreted as the fair charge for playing arm i in this initial state (see Section 2.3.2). The
prevailing charge for each play of the arm is fixed to the fair charge i .s0 / of the initial state.
The player can choose to play or stop at any given time with the objective of maximizing the
expected total discounted profit. The game thus defined is a fair game in which the maximum
expected total discounted profit is zero. Assume that the player prefers to continue the game as
long as there is no loss. The optimal strategy of the player is to play the arm until it enters a
state s 0 in which the fixed prevailing charge i .s0 / exceeds the current fair charge i .s 0 / (i.e., the
prevailing charge is too great to continue playing). This optimal stopping time is Qi .s0 / , which
yields the maximum profit of zero. Note that if the player stops at a state whose fair charge is
greater than the prevailing charge i .s0 /, the player has stopped too early without reaping the
profit offered by this advantageous state; a positive loss would incur.
To incentivize continuous play of the game, consider that the prevailing charge is reduced
to the fair charge of the current state whenever it is too great for the player to continue. That is,
the prevailing charge  i .t/ at time t is the minimum value of the fair charges of the states seen
up to t :
 i .t / D min i .Si .k//: (2.31)
kD1;:::;t

That is, the prevailing charge process f i .t/g t1 is the lower envelope of the index process de-
fined in (2.30), and every sample path of the prevailing charge process is piecewise constant and
monotonically decreasing (see Figure 2.1). This modified game remains to be a fair game.
Now suppose there are N arms and at each time the player chooses one of them to play.
The prevailing charge of each arm traces the lower envelope of the index process associated with
this arm. The resulting game remains to be a fair game, since a strict positive profit from any
arm is impossible whereas the maximum expected total discounted profit of zero is achievable
(by, for example, playing a single arm forever).
Any policy  for the original bandit problem (i.e., without prevailing charges) is a valid
strategy for this fair game, and, when employed, produces at most zero expected profit. The
maximum expected total discounted reward in the original bandit problem is thus upper bounded
by the maximum expected total discounted prevailing charges that can be collected in this fair
game. What remains to be shown is that this upper bound is achieved by the Gittins index
policy.
Since the player’s actions do not change the state evolution of any arm, we can imagine
that the entire sample paths of the states and the associated rewards and prevailing charges are
prefixed, although only revealed to the player one value at a time, following the predetermined
order, from the arm chosen by the player. Since every sample path of the prevailing charges on
each arm is monotonically decreasing, the maximum expected total discounted charges is col-
24 2. BAYESIAN BANDIT MODEL AND GITTINS INDEX
lected when the player always plays the arm with the greatest prevailing charge, thus interleaving
the given N monotonic sequences of prevailing charges into a single monotonic sequence (see
Example 2.4 for a deterministic version of the problem and the argument for the optimality
of a myopic strategy for interleaving monotone reward sequences). By the construction of the
prevailing charges, we arrive at Theorem 2.6. 

2.5 COMPUTING GITTINS INDEX


The Gittins index given in (2.20) may be solved analytically for certain problem instances. In
other cases, generic algorithms exist for computing the index numerically.
Various algorithms have been developed for computing the Gittins index. They fall into
two categories: offline algorithms and online algorithms. The former compute, in advance, the
indices for every possible state of every arm and store them in a lookup table for carrying out
the index policy. This approach is mostly suited for bandit processes with a relatively small state
space. The latter compute the indices on the fly as each new state is realized. They are particularly
efficient in applications that involve a large (potentially infinite) state space, but with a sparse
transition matrix promising that only a small fraction of states get realized online. We present
in this section one representative algorithm from each category. Chakravorty and Mahajan,
2014 [59] gave an overview of algorithms for computing Gittins index with detailed examples.
It is worth pointing out that the exact values of the Gittins indices are not essential to
the implementation of the index policy. What is needed is simply which arm has the greatest
index at each given time. There are problems with sufficient structures in which the arm with
the greatest index can be identified without explicitly computing the indices. We see such an
example in Section 6.1.1 in the context of the Whittle index for a restless bandit problem. There
are also iterative algorithms that successively refine the intervals containing the indices until
the intervals are disjoint and reveal the ranking of the indices (see, for example, Ben-Israel and
Flam, 1990 [32] and Krishnamurthy and Wahlberg, 2009 [121]).

2.5.1 OFFLINE COMPUTATION


The algorithm developed by Varaiya, Walrand, and Buyukkoc, 1985 [200], referred to as the
largest-remaining-index algorithm, computes the Gittins indices in decreasing order by exploit-
ing the monotone property of the optimal stopping sets as specified in Property 2.5.
Since arms are decoupled when computing their indices, we consider a single arm and
omit the arm index from notations. Suppose that the arm has m states S D f1; 2; : : : ; mg with
transition matrix P. Let .s1 ; s2 ; : : : ; sm / be a permutation of these m states such that their Gittins
indices are in decreasing order:
.s1 /  .s2 /  : : :  .sm /: (2.32)
The algorithm proceeds by first identifying the state s1 and computing its index .s1 /. Based on
Property 2.5, it is easy to see that the stopping set Q .s1 / can be set to S , the entire state space.
2.5. COMPUTING GITTINS INDEX 25
The index of s1 is thus simply the immediate reward offered by s1 . Since s1 is the state with the
greatest index value (hence the greatest immediate reward), we have

s1 D arg max r.s/; (2.33)


s2S
.s1 / D r.s1 /; (2.34)

where ties are broken arbitrarily.


Suppose that the top k 1 largest Gittins indices have been computed and the corre-
sponding states fs1 ; : : : ; sk 1 g identified. The next step is to identify sk and compute .sk /. It is
known from Property 2.5 that Q .sk / D S n fs1 ; : : : ; sk 1 g. If we use Q .sk / , the hitting time
of Q .sk /, as the stopping time in the right-hand side of (2.20) for calculating a hypothetical
index for each of the remaining states in S n fs1 ; : : : ; sk 1 g, we know that the thus obtained in-
dex for sk would equal to its Gittins index .sk / since the optimal stopping time Q .sk / is used
in the calculation. On the other hand, the calculated indices for all other states are no greater
than their Gittins indices (due to the potentially suboptimal stopping time used in the calcula-
tion), which in turn are no greater than .sk /. We thus obtain .sk / as the largest value among
these hypothetical indices calculated using Q .sk / . Specifically, let vk and wk denote, respec-
tively, the m  1 vector of the expected discounted reward and the expected discounted time
until Q .sk / for each initial state s 2 S . Based on basic results on calculating expected hitting
time of a Markov chain, we have

  1
vk D I Qk
ˇP r; (2.35)
  1
wk D I Qk
ˇP 1; (2.36)

where r D fr.s/gs2S is the m  1 vector of the immediate reward of each state, 1 is a vector of
all 1’s, and PQ k is a modified transition matrix after replacing all columns corresponding to states
in Q .sk / with 0, i.e., for all i D 1; : : : ; m,
(
p.i; j / if j 2 fs1 ; : : : ; sk 1g
Q k .i; j / D
P : (2.37)
0 otherwise

We then have

vk .j /
.sk / D max ; (2.38)
j 2S nfs1 ;:::;sk 1 g wk .j /
vk .j /
sk D arg max ; (2.39)
j 2S nfs1 ;:::;sk 1 g wk .j /
26 2. BAYESIAN BANDIT MODEL AND GITTINS INDEX
where vk .j / and wk .j / are the elements of vk and wk that correspond to state j . The procedure
continues until all m indices have been computed.

Example 2.7 Consider an arm with three states S D fa; b; cg and the following reward vector
and transition matrix: 0 1 01 1 11
2 4 2 4
r D @5A ; P D @ 12 0 12 A : (2.40)
3 1 0 0

Suppose that ˇ D 45 . Compute the Gittins indices for the three states.
State b has the highest immediate reward of 5. We thus have .b/ D r.b/ D 5, which is
the largest index value among the three states.
To identify the state s2 with the second largest index and its index value, we use the
stopping set Q .s2 / D S n fbg D fa; cg. Rewrite (2.35) in a more intuitive form of

Q k vk ;
vk D r C ˇ P (2.41)

which corresponds to the following three equations easily seen from the modified Markovian
state transitions which dictate termination at states a and c :

v2 .a/ D r.a/ C ˇp.a; b/v2 .b/; (2.42)


v2 .b/ D r.b/ C ˇp.b; b/v2 .b/; (2.43)
v2 .c/ D r.c/ C ˇp.c; b/v2 .b/: (2.44)

A similar set of equations for w2 .a/; w2 .b/; w2 .c/ can be written (by replacing the immediate
reward in each state with 1 for counting the discounted time rather than rewards until stopping).
Solving these equations gives us
0 1 071
4 5
v2 D @5A ; w2 D @ 1 A :
3 1

State c thus has the second largest index with .c/ D 3. For the last state a which has the smallest
index value, the stopping set is fag, which leads to .a/ D 73=44.

Other offline algorithms for computing Gittins index include variations of the largest-
remaining-index by Denardo, Park, and Rothblum, 2007 [73], Sonin, 2008 [182], and Denardo,
Feinberg, and Rothblum, 2013 [74], and algorithms based on linear programming by Chen and
Katehakis, 1986 [61], and parametric linear programming by Kallenberg, 1986 [106] and Nino-
Mora, 2007a [157]. All these algorithms enjoy computational complexity in the order of O.m3 /
with differences in the leading constants, where m is the size of the state space.
2.6. SEMI-MARKOV BANDIT PROCESSES 27
2.5.2 ONLINE COMPUTATION
The above largest-remaining-index algorithm computes the indexes of all states in a descending
order of their index values. It does not give an efficient solution when only the indexes of a subset
of states are needed.
The interpretation of Gittins index of a state s0 as the value of an MDP with the option
of restarting at s0 leads to an online computation of indexes for those states seen by the system
in real time. Standard MDP solution techniques such as value iteration, policy iteration, and
linear programming can be used to compute the value of this two-action MDP. Specifically, the
optimality equations of this MDP are as follows. For all s 2 S ,
( )
X X
0 0 0 0
V .s/ D max r.s0 / C ˇ p.s0 ; s /V .s /; r.s/ C ˇ p.s; s /V .s / ; (2.45)
s 0 2S s 0 2S

where the first term inside the maximization corresponds to the action of restarting to state s0
and the second term the action of continuing with the current state s . An  -approximation of the
Gittins index of s0 , given by V .s0 / (with the constant factor of 1 1ˇ ignored), can be computed
using the value iteration algorithm as given in Algorithm 2.1.

2.6 SEMI-MARKOV BANDIT PROCESSES


We have thus far focused on bandit problems where the decision times are discrete and pre-
determined, which, without loss of generality, can be set to positive integers t D 1; 2; : : :. In
this section, we consider bandit problems where the inter-decision times are random and state
dependent. We show that the index theorem carries through to such semi-Markov bandit prob-
lems.
Definition 2.8 The Semi-Markov Bandit Model:
Consider N arms, each with state space Si (i D 1; : : : ; N ). At each time t 2 RC , only one arm
may be active. Time t0 D 0 is the first decision time for choosing one arm to activate. Let tk
denote the k th decision time. Suppose that at time tk , arm i is in state s and is chosen for
activation. Then the following occur.
• A reward ri .s/ is immediately obtained at time tk .
• The state of arm i at the next decision time tkC1 is s 0 with probability pi .s; s 0 /.
• The inter-decision time Ti D tkC1 tk of arm i is drawn, independent of previous inter-
decision times, according to a distribution3 Fi . j s; s 0 / that depends on the states s and s 0
at times tk and tkC1 , i.e.,
 
Pr Ti  x j s; s 0 D Fi .x j s; s 0 /: (2.46)
3 We assume the usual technical condition on F for semi-Markov decision processes to ensure that an infinite number of
i
state transitions will not occur in finite time (see, for example, Chapter 5 of Ross, 1970 [170] and Section 11.1 of Puterman,
2005 [164]).
28 2. BAYESIAN BANDIT MODEL AND GITTINS INDEX
The states of all passive arms remain frozen. At the next decision time tkC1 determined by Ti
drawn from Fi . j s; s 0 /, with arm i in state s 0 , a decision on which arm to activate next is made,
and the process continues. The objective is to maximize the expected total discounted reward
with a discount factor ˇ (0 < ˇ < 1).

We give below two examples of semi-Markov bandit model.


Example 2.9 Continuous-Time Bandit Processes:
A bandit problem in which each arm, when activated, evolves according to a continuous-time
Markov process, is an example of the semi-Markov bandit model, where the decision times
are given by the state transition times of activated arms. Let Qi D fqi .s; s 0 /gs;s 0 2Si denote the
transition rate matrix of the corresponding continuous-time Markov process associated with
arm i . Then the state transition probabilities and the inter-decision times in the resulting semi-
Markov bandit model are given by
qi .s; s 0 /
pi .s; s 0 / D ; Ti .s; s 0 /  exp. qi .s; s//: (2.47)
qi .s; s/
Note that in this case, the inter-decision time Ti .s; s 0 / depends only on the state s in the current
decision period.

Example 2.10 Bandit Processes with Restricted Switching States:


Consider a standard discrete-time bandit model as defined in Definition 2.2 with the ex-
ception that we cannot switch out of an active arm i until it enters a state in set Wi  Si
(i D 1; 2; : : : ; N ). If Wi D Si for all i , we go back to the standard model. If each arm represents
a job and Wi contains the completion state of job i , then this model represents a non-preemptive
job scheduling problem (we discuss more the job scheduling problem in Section 3.2.2).
For this problem, the decision time following an activation of an arm (say, i ) is when the
arm enters a state in Wi . The state transition probabilities in the resulting semi-Markov bandit
model are pi .s; s 0 / D 0 for s 0 … Wi . For s 0 2 Wi , the transition probability pi .s; s 0 / and the
inter-decision time Ti .s; s 0 / are given by the corresponding absorbing probability and absorbing
time in the modified Markov process with states in Wi as absorbing states.

Analogous to (2.20), the index of a semi-Markov bandit process in state s is given by


hP ˇ i
tk ˇ
E tk < ˇ ri .S i .t k // ˇ Si .0/ D s
i .s/ D max R  ˇ  ; (2.48)
>0 E t D0 ˇ t dt ˇ Si .0/ D s
where the stopping time  is constrained to the set of decision times ft1 ; t2 ; : : :g which are the
transition times of the semi-Markov process defined by fpi .s; s 0 /gs;s 0 2Si and Fi . j s; s 0 /.
With relatively minor modifications to the proof given in Chapter 2.4, it can be shown
that playing the arm with the greatest index given in (2.48) at each decision time is optimal
2.6. SEMI-MARKOV BANDIT PROCESSES 29
for semi-Markov bandit problems. A detailed proof can be found in Section 2.8 of Gittins,
Glazebrook, and Weber, 2011 [90].
31

CHAPTER 3

Variants of the Bayesian Bandit


Model
It is natural to ask the question that to what extend the bandit model can be generalized without
breaking the optimality of the Gittins index policy. Extensive studies have been carried out since
Gittins’s seminal work, aiming to push the applicability boundary of the index theorem. This
chapter is devoted to these efforts. We start by identifying delimiting features of the bandit
model that are necessary for the index theorem. We then cover a number of variants of the
bandit model that are significant in their technical extensions and general in their applicability.
Based on which key feature of the bandit model is being extended, we group these variants into
four classes: variations in the action space, in the system dynamics, in the reward structure, and
in the performance measure.

3.1 NECESSARY ASSUMPTIONS FOR THE INDEX


THEOREM
We list below the key features that delimit the bandit model, followed by a discussion on which
modeling assumptions are necessary and which are not for the index theorem.

Modeling Assumptions of the Bandit Model:


F On the action space:

A1.1 Only one arm can be activated each time.


A1.2 The arms are otherwise uncontrolled, that is, the decision maker chooses which arm
to operate, but not how to operate it.
A1.3 The set of arms are fixed a priori, and all arms are available for activation at all times.
F On the system dynamics:

A2.1 An active arm changes state according to a homogeneous Markov process.


A2.2 State evolutions are independent across arms.
A2.3 The states of passive arms are frozen.
F On the reward structure:
32 3. VARIANTS OF THE BAYESIAN BANDIT MODEL
A3.1 An active arm offers a reward dependent only on its own state.
A3.2 Passive arms offer no reward.
A3.3 Arm switching incurs no cost.

F On the performance measure:

A4 Expected total (geometrically) discounted reward over an infinite horizon as given


in (2.3).

3.1.1 MODELING ASSUMPTIONS ON THE ACTION SPACE


A1.1 is necessary for the index theorem. If K > 1 arms can be activated simultaneously, it is in
general suboptimal to active the K arms with the top K greatest Gittins indices. This can be
easily seen from the following counterexample.

CounterExample 3.1 Consider Example 2.4 with N D 3 stacks of coins. The expected reward
sequence, starting from the top coin, is .0:8; 0; 0; : : :/ for the first and the second stacks, and
.0:6; 0:96; 0; 0; : : :/ for the third stack. At each time, we are allowed to flip k D 2 coins simulta-
neously. The discount factor ˇ is set to 0:5.
It is easy to see that at t D 1, the Gittins indices of these three arms are 0:8, 0:8, and
0:72, respectively. If we choose the two arms with the top two greatest Gittins index to activate
at each time, we would choose the top coins in the first and the second stacks to flip at t D 0,
accruing a total expected reward of 0:8 C 0:8 D 1:6, and then flip the top coin with bias 0:6 in
the third stack along with a zero-reward coin from either of the first two stacks at t D 1, and
finally flip the second coin with bias 0:96 in the third stack at t D 3. There is no reward to be
earned afterward. The total discounted reward is 1:6 C 0:6ˇ C 0:96ˇ 2 D 2:14.
If, on the other hand, we flip the top coins in the first and the third stacks at t D 1 and
then the top coin in the second stack along with the second coin in the third stack at t D 2, we
would have obtained a total discounted reward of .0:8 C 0:6/ C .0:8 C 0:96/ˇ D 2:28. It is easy
to see that this is the optimal strategy.

Rather than activating the arms with the top k greatest Gittins indices, one might wonder
whether there are other ways of extending the index policy that preserve the optimality in the
case of simultaneous multi-arm activation. For instance, we may consider to define each subset
of k arms as a super-arm with a k -dimensional state space and obtain its index in a similar way
as (2.20). The hope here is that such an index would capture the maximum rate of the rewards
jointly offered by k arms and A1.1 is satisfied in terms of super-arms. Unfortunately, for such
a bandit model, A2.2 and A2.3 no longer hold. In particular, a passive super-arm may change
state if it shares common arms with the activated super-arm. As discussed below, both A2.2 and
A2.3 are necessary for the optimality of the index policy. The general structure of the optimal
policy in the case of simultaneous multi-arm activation remains open.
3.1. NECESSARY ASSUMPTIONS FOR THE INDEX THEOREM 33
A1.2 is also necessary. When there exist multiple ways to activate an arm, each result-
ing in different rewards and associated with different state transition probabilities, the problem
is referred to as the bandit superprocess model. This variant in general does not admit an index
structure. Sufficient conditions, albeit restrictive, have been established for the optimality of the
index policy. We discuss this variant in detail in Section 3.2.1.
There are two general directions to relax A1.3. One is to allow precedence constraints on
available arms: an arm may not be available for activation until several other arms reach certain
states. Such precedence constraints are common in job scheduling where a job cannot start
until the completion of several other jobs. The other direction is to allow arrivals of new arms
throughout the decision horizon. We discuss in Section 3.2.2 classes of precedence constraints
and arrival processes of new arms that preserve the optimality of the index policy.

3.1.2 MODELING ASSUMPTIONS ON THE SYSTEM DYNAMICS


A2.1 on the Markovian dynamics of state evolution under activation is crucial to the MDP
formulation of the problem. It is, however, not necessary for the optimality of the index policy
as discussed in Section 2.3.3. While the Markovian assumption of the arm evolutions is not
necessary for the index theorem, it is crucial to the numerical algorithms for computing the
index as seen in both the offline and online algorithms discussed in Section 2.5. For arbitrary
reward processes, the index, given by the supremum value over all stopping times on an arbitrary
stochastic process (see (2.20)), is generally intractable to compute even numerically.
A2.2 on the statistical independence across arms is necessary for the optimality of the
index policy, as intuition suggests. Problems involving localized arm dependencies can be trans-
lated into the bandit superprocess model. Consider, for example, arm 1 and 2 are dependent in
an N -arm bandit problem. We can treat arm 1 and 2 jointly as a superprocess with two control
actions (playing arm 1 or 2 when this superprocess is activated). The value of such a formula-
tion, however, depends on how local the dependencies are across arms and whether the resulting
superprocesses admit an index rule as an optimal solution.
A2.3 is necessary for the optimality of the index policy. It is perhaps the most crucial fea-
ture of the bandit model used in the proof of the optimality of the index policy. This assumption
ensures that an arm, if not played at the current time, promises the same future rewards except
for the exogenous discounting. Its reward process is simply frozen in time, waiting to be col-
lected in the future. Arm selection policies thus affect only how the N reward sequences offered
by these N arms are interleaved, but not the reward sequences themselves. The best interleaving
is the one that mitigates, to the maximum extent, the impact of discounting. It is then intuitive
that the optimal policy chooses the arm with the largest reward rate (given by the Gittins index)
to ensure that the reward segment with the highest equivalent constant rate is collected first,
hence experiencing the least discounting. We further illustrate the necessity of A2.3 with the
following counterexample.
34 3. VARIANTS OF THE BAYESIAN BANDIT MODEL
CounterExample 3.2 Consider Example 2.4 with N D 2 stacks of coins and ˇ D 0:5. The
expected reward sequences offered by these two arms are, respectively, .0:8; 0; 0; : : :/ and
.0:7; 0:6; 0; 0; : : :/. We now assume that A2.3 is violated by considering a scenario where each
time the second arm is not played, the top coin from this stack is removed without any reward.
It is easy to see that if we follow the Gittins index policy that plays arm 1 at t D 1, the
reward of 0:7 offered by the top coin in the second stack cannot be realized, and the total dis-
counted reward is 0:8 C 0:6ˇ D 1:1. If instead, we collect the two positive rewards offered by
arm 2 first, the total discounted reward would be 0:7 C 0:6ˇ C 0:8ˇ 2 D 1:2.

While the Gittins index policy is generally suboptimal when A2.3 is violated, one may
question whether there are index forms that take into account the dynamics of passive arms and
offer optimality. This question was first explored by Whittle, 1988 [212], when he proposed
the restless bandit model and developed an index known as the Whittle index. We discuss this
important variant of the bandit model in Section 3.3.

3.1.3 MODELING ASSUMPTIONS ON THE REWARD STRUCTURE


A3.1 is necessary. Consider, for example, arm 1 offers an exceptionally high reward when arm 2
is in a particular state. It is then desirable to first drive arm 2 to this particular state even though
its current state has a small index. In general, any dependency across arms, be it in the state
dynamics or the reward structure, threatens the optimality of an index policy that fundamentally
decouples the arms.
A3.2 is not necessary. When passive arms also offer rewards, a simple modification to the
Gittins index preserves the index theorem as discussed in Section 3.4. This variant often arises
in queueing networks where unserved queues incur holding costs (negative rewards).
A3.3 is necessary. We show in Section 3.4 that a bandit with switching cost can be cast
as a special restless bandit problem.

3.1.4 MODELING ASSUMPTIONS ON THE PERFORMANCE MEASURE


The Gittins index policy is optimal in maximizing the (geometrically) discounted reward over an
infinite horizon. A change in the performance measure may render it suboptimal. In particular,
consider a general discount sequence f˛ t g t0 and the objective of maximizing the total reward
discounted according to f˛ t g t 1 :
"1 #
X
E ˛ t r.S.t /; .S.t /// j S.1/ D s : (3.1)
t D1

It has been shown by Berry and Fristedt, 1985 [33], that for a general index theorem to hold,
the discount sequence must be geometric1 in the form of ˛ t D ˇ t 1 for some ˇ 2 .0; 1/ (see also
Section 3.4.2 of Gittins, Glazebrook, and Weber, 2011 [90]).
1 For continuous-time problems, the discount function must be exponential in the form of ˛ t D e t for some > 0.
3.2. VARIATIONS IN THE ACTION SPACE 35
This result also excludes the possibility of a general index theorem for the finite horizon
problem, which corresponds to a discount sequence of ˛ t D 1 for t D 1; 2; : : : ; T and ˛ t D 0 for
t > T (see (2.2)). A finite-horizon MDP problem generally dictates a nonstationary policy for
optimality: as time approaches the end of the horizon, the optimal decision tends to be more
myopic and less forward planning. Being a stationary policy, the suboptimality of the index
policy is expected, except in special cases. In particular, the luxury of achieving the maximum
equivalent constant reward rate (i.e., the index) by activating an arm until the optimal stopping
time   may no longer be feasible if   exceeds the remaining horizon. The optimal action needs
to strike a balance between the reward rate and the required time for achieving that rate. The
problem has an interesting connection with a stochastic version of the classical NP-hard problem
of knapsack, in which a burglar, with a knapsack of a given weight capacity (corresponding to T ),
needs to choose from items of different values (in this case, the index values as the maximum
reward rates) and weights (the stopping times for achieving the index values). It is clear that
the optimal solution requires a joint consideration of all available items; any index policy that
decouples the items is suboptimal in general.
There are optimality criteria other than the geometrically discounted reward under which
the index theorem holds. In Section 3.5, we discuss two such criteria: the total undiscounted
reward over a policy-dependent random horizon and the average reward over an infinite horizon.
The former leads to the so-called stochastic shortest path bandit model arising in applications
concerned with finding the most rewarding (or the least costly) path to a terminating goal such
as finding an object or accomplishing a task.

3.2 VARIATIONS IN THE ACTION SPACE


In this section, we discuss several variants of the bandit model that have a more complex action
space.

3.2.1 MULTITASKING: THE BANDIT SUPERPROCESS MODEL


The bandit superprocess model relaxes assumption A1.2 and allows multiple control options for
activating an arm. This variant and the restless bandit model discussed in Section 3.3 are ar-
guably the two most significant extensions to the bandit model in terms of their generality and
applicability. In particular, all forms of arm dependencies, be it dependencies in state evolution,
in reward structure, or in terms of precedence constraints, result in special cases of bandit su-
perprocess models. Specifically, dependent arms can be grouped together as a superprocess with
control options consisting of playing each constituent arm of the chosen superprocess.

Definition 3.3 The Superprocess Model:


Consider N Markov decision processes referred to as superprocesses, with state space
Si , action space Ai , reward function fri .s; a/gs2Si Ia2Ai , and state transition probabilities
fpi .s; s 0 I a/gs;s 0 2Si Ia2Ai for i D 1; 2; : : : ; N . At each time, two control decisions are made. First,
36 3. VARIANTS OF THE BAYESIAN BANDIT MODEL
one of the superprocesses, say i , is chosen for continuation; the rest N 1 superprocesses are
frozen and offer no reward. Second, a control a 2 Ai is selected and applied to superprocess i .
The objective is to maximize the expected total discounted reward with a discount factor ˇ . It is
assumed that the control set Ai is finite for all i .

A juxtaposition of the superprocess model with the bandit model would illuminate the
significantly increased complexity of the former. An N -armed bandit problem consists of N
Markov reward processes, one associated with each arm. The objective is to interleave these
N reward processes into a single stream that has a maximum discounted sum. A superprocess
model consists of N Markov decision processes. It is a composite decision problem that controls,
at the inner level, the evolution of each of these N MDPs, and at the outer level, the interleave
of these constituent decision processes.
To see the connection between the bandit and the superprocess models, consider that the
inner-level control is fixed with a specific (stationary deterministic Markov) policy i applied to
superprocess i for i D 1; : : : ; N . The superprocess problem then reduces to an N -armed bandit
problem, with each arm given by a Markov reward process characterized by fri .s; i .s//gs2Si
and fpi .s; s 0 I i .s//gs;s 0 2Si . This leads to another way of viewing the two levels of control in
the superprocess model. Each superprocess consists of a collection of Markov reward processes,
each corresponding to a stationary deterministic Markov policy. The inner control determines
which N reward processes, one from each superprocess, to interleave; the outer-level control
determines how to interleave. It is perhaps not surprising that in most cases, the optimal com-
position of the N reward processes can only be determined by considering all N superprocesses
jointly. In other words, which reward process (equivalently, which control policy i ) of a super-
process contributes the most to the discounted sum of the final interleaved single reward stream
depends on the other N 1 reward processes that it will be interleaved with. This in general
renders an index-type policy suboptimal.
We give below a simple example showing that even when a superprocess is to be inter-
leaved with a reward sequence of a constant value , which reward process of the superprocess
should be chosen depends on the value of .

Example 3.4 Consider a bandit superprocess model consisting of a superprocess and a standard
arm offering a constant reward . The superprocess has three states: S D f1; 2; 3g. When in
state 1, there are two control alternatives: a1 and a2 . Control a1 changes the state from 1 to 2
and offers a reward of r.1; a1 /. Control a2 changes the state from 1 to 3 and offers a reward of
r.1; a2 /. States 2 and 3 are absorbing states (i.e., with a self-loop of probability 1) with no control
alternatives and offering rewards r.2/ and r.3/, respectively. Suppose that r.1; a2 / > r.1; a1 / >
r.2/  r.3/ and the initial state is S.1/ D 1.
It is easy to see that the superprocess with initial state 1 consists of two reward processes
fr.1; a1 /; r.2/; r.2/; : : :g and fr.1; a2 /; r.3/; r.3/; : : :g, corresponding to applying a1 and a2 at
state 1, respectively. When the constant reward  of the standard arm satisfies  < r.3/, the
3.2. VARIATIONS IN THE ACTION SPACE 37
first reward sequence of the superprocess should be selected (i.e., the optimal control at state
1 is a1 ) and the standard arm will never be activated. This leads to a total discounted reward
of r.1; a1 / C ˇ 1r.2/ˇ . When r.1; a2 / >  > r.1; a1 /, it is optimal to select the second reward
sequence of the superprocess by applying a2 at the initial state 1 and then switching to the
standard arm. This yields a total discounted reward of r.1; a2 / C ˇ 1 ˇ .

When multiple loosely coupled decision problems competing for early attention (thus less
discounting), the optimal strategy tends to be more myopic than the optimal policies for each
individual decision problems. Adopting the optimal policy for each decision problem, and then
using the index policy as the outer-level control for allocation, is in general suboptimal.
Nevertheless, the hope for an index-type optimal policy is not all lost. Next, we show
a necessary and sufficient condition for the index theorem and give a natural extension of the
Gittins index to a superprocess. We start with the latter.
For each of the Markov reward process associated with a given control policy i for su-
perprocess i with initial state s , we can obtain its Gittins index as given in (2.20):
P ˇ 
Ei tD1 ˇ
t 1
ri .Si .t/; i .Si .t/// ˇ Si .1/ D s
i .sI i / D max P ˇ  : (3.2)
t 1 ˇ S .1/ D s
Ei t D1 ˇ
1 i

A natural candidate for the index of superprocess i at state s is thus


i .s/ D max i .sI i /; (3.3)
i

the largest Gittins index among all reward processes of superprocess i . Let ai .s/ denote the
optimal control at state s that attains i .s/. The index policy is thus as follows: given the cur-
rent states .s1 ; s2 ; : : : ; sN / of the N superprocesses, the process i  with the greatest index value
i  .si  / D maxi i .si / is activated with control ai .si  /.
In order for any index policy, which decouples the superprocesses, to be optimal, it is
necessary that whenever this index policy activates superprocess i at state s , it chooses the same
control a, irrespective of the states of other superprocesses. In other words, each superprocess has
a dominating reward process in its collection of Markov reward processes that should be chosen
regardless of the N 1 reward processes that it will be interleaved with. It was shown by Whittle,
1980 [210] that this condition holds if and only if the same can be said about superprocess i
when it is to be interleaved with a constant reward sequence. We state this condition below.
Condition 3.5 Whittle Condition for the Superprocess Model:
A superprocess is said to satisfy the Whittle Condition if in a bandit superprocess model con-
sisting of this superprocess and a standard arm offering a constant reward , the optimal control
policy for the superprocess is independent of . More specifically, there exists a control policy
  (a mapping from the state space S of the superprocess to the control set A) such that, for
all s and  for which it is optimal to activate the superprocess, it is optimal to apply control
a D   .s/.
38 3. VARIANTS OF THE BAYESIAN BANDIT MODEL
When the Whittle condition holds for each constituent superprocess, the problem reduces
to a bandit problem by considering, without loss of optimality, only the dominating reward
process of each superprocess generated by its dominating policy i (as defined in the Whittle
condition). The policy associated with the index

i .s/ D max i .sI i / D i .sI i / (3.4)


i

is thus optimal.

Theorem 3.6 Consider the superprocess model defined in Definition 3.3. Suppose that every super-
process satisfies the Whittle condition. The index policy with respect to the index defined in (3.3) is
optimal.

This condition, however, is a strong condition. Example 3.4 gives a glimpse into how easily
it can be violated. There are, however, special classes of the superprocess model with rather
general applicability that satisfies the Whittle Condition. For example, the class of stoppable
bandits introduced by Glazebrook, 1979 [92].
Glazebrook, 1982 [93], showed that a general bandit superprocess model does not admit
an index-type optimal policy. An upper bound on the value function of a general superprocess
problem was established by Brown and Smith, 2013 [44], based on the so-called Whittle Inte-
gral constructed by Whittle, 1980 [210]. Hadfield-Menell and Russell, 2015 [96], developed
efficient algorithms for computing the upper bound and  -optimal policies.

3.2.2 BANDITS WITH PRECEDENCE CONSTRAINTS


We now discuss relaxing modeling assumption A1.3 by allowing precedence constraints that
restrict the availability of arms. Bandits with precedence constraints often arise in stochastic
scheduling of jobs/projects. Jobs are the constituent arms in the resulting bandit model. Prece-
dence constraints, however, lead to dependencies among arms, and the problem falls within the
superprocess model. Being a special class of the superprocess model, however, it preserves the
index theorem when the precedence constraints or the job characteristics satisfy certain condi-
tions.
Jobs, precedence constraints, and the superprocess formulation: A generalized job is a ban-
dit process with a terminating state representing the completion of the job beyond which no
reward may be accrued. A job that yields a single reward at the time of completion and no re-
wards before or after completion is called a simple job. A simple job is thus characterized by a
random processing time X and a reward r earned upon completion.
For a collection of N jobs denoted by f1; 2; : : : ; N g, a set C of precedence constraints
has elements of the form fi > j g representing the restriction that the service of job j can only
commence after the completion of job i . Only scheduling strategies that comply with C are
admissible.
3.2. VARIATIONS IN THE ACTION SPACE 39
A precedence constraint set C can be represented by a digraph DC . The vertices of the
digraph are the jobs. An arc (i.e., a directed edge) exists from vertex i to vertex j if and only if
C contains the precedence constraint fi > j g. Figures 3.1 and 3.2 show two examples.
Each connected component of DC constitutes a superprocess. The state of the superprocess
is the Cartesian product of the states of the jobs in this connected component. The control
options are in choosing which unfinished job to serve when this superprocess is activated. In the
example shown in Figure 3.2, there are two superprocesses, each consisting of jobs f1; : : : ; 6g
and f7; : : : ; 11g, respectively.

Scheduling generalized jobs with preemption: We start with the most general formulation of
stochastic scheduling: jobs are general bandit processes with the addition of terminating states,
the precedence constraints are arbitrary, and preemption is allowed (i.e., an unfinished job can

Figure 3.1: Precedence constraints invalidating the index theorem.

Stage 1 Stage 2 Stage 3 Stage 4

5
3
1 2 6
4

7 9

10 11

Figure 3.2: Precedence constraints forming an out forest.


40 3. VARIANTS OF THE BAYESIAN BANDIT MODEL
be interrupted and resumed at a later time without any loss). For this general problem, the
index theorem does not hold, as shown by the following counter example constructed based on
Example 4.1 from Gittins, Glazebrook, and Weber, 2011 [90].

CounterExample 3.7 Consider three jobs f1; 2; 3g with required service time Xi and reward
ri upon completion (i D 1; 2; 3). Suppose that X1 takes values of 1 and M (M  1) with equal
probability, X2 D X3 D 1, and r1 D 1, r2 D 2, and r3 D 64. The precedence constraints are such
that job 3 can only start after jobs 1 and 2 are completed (see Figure 3.1). Rewards are discounted
such that completion of job i at time t contributes ˇ t 1 ri to the total payoff, where the discount
factor ˇ D 12 . Preemption is allowed.
The problem can be modeled as a single superprocess consisting of all three jobs. Each
scheduling policy  results in a Markov reward process with associated Gittins indices as given
in (3.2). The scheduling strategy chosen under the index rule is the policy that maximizes the
index in the given state (see (3.3)). We show next that the scheduling policy determined this
way is suboptimal.
Since r2 > r1 , it is easy to see that the optimal strategy   is to schedule job 2 first,
followed by job 1 and then job 3 upon the completion of job 1. The index policy, however, would
schedule job 1 first as detailed below.
Let s0 denote the initial state of this superprocess at t D 1, when no job has received any
service time. To compute the index .s0 /, it suffices to compare two policies: policy   specified
above and policy  0 that schedules job 1 at t D 1, switches to job 2 at t D 2, and then switch
back to job 1 at t D 2 if it has not been completed, and then job 3. All other admissible policies
would yield index values smaller than that of   or  0 . It can be shown that the optimal stopping
times attaining the indices .s0 I   / and .s0 I  0 / are at the time when job 1 receives 1 unit of
service time if it fails to terminate, and otherwise are at the completion of all jobs (i.e., t D 3). To
simplify the calculation of the indices, we assume M D 1, which does not affect the ordering
of the indices for sufficiently large M . We thus have

 P ŒX1 D 1 r2 C ˇr1 C ˇ 2 r3 C P ŒX1 D M r2
.s0 I  / D ' 6:31 (3.5)
P ŒX1 D 1 .1 C ˇ C ˇ 2 / C P ŒX1 D M  .1 C ˇ/

P ŒX1 D 1 r1 C ˇr2 C ˇ 2 r3 C P ŒX1 D M   0
.s0 I  0 / D ' 6:55: (3.6)
P ŒX1 D 1 .1 C ˇ C ˇ 2 / C P ŒX1 D M   1

Since .s0 I  0 / > .s0 I   /, the index policy thus follows  0 .

The reason behind the suboptimality of the index policy in the above example is the de-
pendency, through the common out-neighbor (i.e., job 3 in the precedence digraph depicted in
Figure 3.1), between the reward streams associated with scheduling job 1 and scheduling job 2 at
t D 1. These two alternatives thus cannot be considered in isolation by comparing their indices
for choosing the optimal action.
3.2. VARIATIONS IN THE ACTION SPACE 41
A natural question is thus what topological structures of the precedence digraph would
preserve the index theorem. CounterExample 3.7 shows that the key is to ensure that at any
given time, all jobs ready for service should not share common out-neighbors. This is satisfied
by precedence constraints that form an out-forest as defined below.
Definition 3.8 An out-forest is a directed graph with no directed cycles and in which the
in-degree of each vertex is at most one.

The directed graph in Figure 3.2 is an out-forest with two out-trees. Since the in-degree
of each vertex is at most one, no two vertices in an out-forest share a common out-neighbor.
The index policy for scheduling generalized jobs with preemption and with precedence
constraints forming an out-forest can be obtained through an interesting combination of back-
ward induction with forward induction. We describe the policy using the example in Figure 3.2.
The two connected components in Figure 3.2 constitute the two superprocesses. The first
step is to reduce each superprocess to a single Markov reward process, hence a bandit process,
by fixing the inner-level control. Take the first superprocess as an example. The precedence
constraints partition the decision horizon into four stages, punctuated in sequence by the com-
pletion times of jobs 1, 2, and 3. The optimal policy for this MDP can be obtained by combining
a backward induction over these four stages and a forward induction (i.e., the Gittins index pol-
icy) within each stage. Specifically, the backward induction starts with the fourth stage, which
begins upon the completion of job 3. The remaining decision problem is the scheduling of two
sink vertices—jobs 5 and 6—that are two independent reward processes. The problem is thus a
two-armed bandit for which the index policy is optimal. Let 5 ˚ 6 denote the reward process re-
sulting from interleaving the reward streams of job 5 and job 6 based on the Gittins index policy.
We now move backward in time to the beginning of the third stage at the completion time of
job 2. The decision problem starting from this time instant is the scheduling of jobs 3 and 4. The
problem is again a two-armed bandit, in which the reward process associated with scheduling
job 3 is .3; 5 ˚ 6/ that concatenates the reward process of job 3 with 5 ˚ 6 to account for the
entire remaining horizon. The optimal strategy is the index policy that results in a single reward
process .3; 5 ˚ 6/ ˚ 4 that optimally interleaves the reward process of .3; 5 ˚ 6/ and that of job
4. Continue the backward induction to t D 1. We have reduced this superprocess to a single re-
ward process given by .1; 2; ..3; 5 ˚ 6/ ˚ 4/. Repeat this procedure for the second superprocess
consisting of jobs f7; : : : ; 11g. The outer-level control for scheduling these two superprocesses is
thus reduced to a two-armed bandit problem with the two reward processes obtained in the first
step.
The following theorem, due to Glazebrook, 1976 [91], states the optimality of the index
policy.
Theorem 3.9 For the superprocess model for scheduling generalized jobs with preemption and with
precedence constraints forming an out-forest, the index policy is optimal.
42 3. VARIANTS OF THE BAYESIAN BANDIT MODEL
3.2.3 OPEN BANDIT PROCESSES
The canonical bandit model is closed in the sense that the set of arms is fixed over the entire
decision horizon. Nash, 1973 [153], Whittle, 1981 [211], and Weiss, 1988 [209] considered
open bandit processes where new arms are continually appearing. Such open bandit processes
are referred to as arm-acquiring bandits by Whittle, 1981 [211], and branching bandit processes
by Weiss, 1988 [209].
Whittle provided a complete extension of the index theorem to open bandit processes
under the assumption that the arrival processes of new arms are i.i.d. over time and independent
of the states of currently present arms, all past actions, and the state transition of the activated
arm. Weiss considered a more general model where the arrivals may depend on the state of the
arm currently being activated.
It is worth pointing out that while new arms continually appearing and the total num-
ber of arms growing indefinitely, the number of arm types as characterized by state transition
probabilities and reward functions is fixed and finite. The open process also complicates the
characterization of the index form. Differing from the closed model, the index associated with
an arm type involves new arms of different types that may enter the system. Lai and Ying,
1988 [127] showed that under certain stability assumptions, the index rule of a much simpler
closed bandit process is asymptotically optimal for the open bandit process as the discount factor
ˇ approaches 1.

3.3 VARIATIONS IN THE SYSTEM DYNAMICS


As discussed earlier, modeling assumption A2.1 on the Markovian dynamics of active arms is
not necessary for the index theorem. A2.3 is, however, crucial to the optimality of Gittins index.
The next natural question to ask is whether a different index form might offer optimality when
A2.3 does not hold. This question was first considered by Whittle, 1988 [212], who proposed a
restless bandit model that allows passive arms to evolve. This variant is the focus of this section.

3.3.1 THE RESTLESS BANDIT MODEL

Definition 3.10 The Restless Multi-Armed Bandit Model:


Consider N arms, each with state space Si (i D 1; : : : ; N ). At time t D 1; 2; : : :, based on the
observed states ŒS1 .t/; : : : ; SN .t/ of all arms, K (1  K < N ) arms are activated. The reward
and the state transition probabilities of arm i at state s are fri .s/; pi .s; s 0 /gs;s 0 2Si when it is
ı ı
active and fr i .s/; p i .s; s 0 /gs;s 0 2Si when it is passive. The performance measure can be set to
either the expected total discounted reward with a discount factor ˇ (0 < ˇ < 1) or the average
reward, both over an infinite horizon.

The above restless bandit model extends the original bandit model in three aspects: si-
multaneous activation of multiple arms, rewards offered by passive arms, and state evolution of
3.3. VARIATIONS IN THE SYSTEM DYNAMICS 43
passive arms. Allowing passive arms to change state is the most salient feature of the restless
bandit model, hence the name given. Simultaneous play of K arms, while alone would render
Gittins index policy suboptimal, is not essential to the general theory for restless bandits. The
general results apply equally well for K D 1 and K > 1, with the exception that the asymp-
totic optimality of the Whittle index policy requires K growing proportionally with N as N
approaching infinity. Allowing passive arms to offer rewards does not change the problem in
ı
any essential way; one may, without loss of generality, assume r i .s/ D 0 for all i and s . As we
will see in Section 3.4, a simple modification to the Gittins index preserves the index theorem
for a bandit model with rewards under passivity. A similar argument holds for the restless bandit
model.
The restless bandit model significantly broadens the applicability of the bandit model
in diverse application domains. For instance, in communication networks, the condition of a
communication channel or routing path may change even when it is not utilized. In queueing
systems, queues continue to grow due to new arrivals even when they are not served. In target
tracking, targets continue to move when they are not monitored. These applications fall under
the restless bandit model.

3.3.2 INDEXABILITY AND WHITTLE INDEX


To introduce the concepts of indexability and the Whittle index, it suffices to consider a sin-
gle arm. We thus omit the subscript for the arm labels from all notations. We consider the
discounted reward criterion. The treatment of the average reward case can be similarly obtained.
Definition 3.11 A Single-Armed Restless Bandit Problem:
Consider a single-armed restless bandit process with reward functions and state transition prob-
ı ı
abilities given by fr.s/; p.s; s 0 /gs;s 0 2S when it is active and fr .s/; p .s; s 0 /gs;s 0 2S when it is pas-
sive. The objective is an activation policy that maximizes the expected total discounted reward
with a discount factor ˇ (0 < ˇ < 1).

A single-armed restless bandit process is an MDP with two possible actions—


f0 .passive/; 1 .active/g—corresponding to whether the arm is made active or passive. The op-
timal policy   , which is a stationary deterministic Markov policy (see Theorem 2.1), defines a
partition of the state space S into a passive set fs W   .s/ D 0g and an active set fs W   .s/ D 1g,
where   .s/ denotes the action at state s under policy   .
Whittle index measures how rewarding it is to activate the arm at state s based on the
concept of subsidy for passivity. Specifically, we modify the single-armed restless bandit process by
introducing a subsidy  2 R which is a constant additional reward accrued each time the arm is
made passive. This subsidy  changes the optimal partition of the passive and active sets: when 
approaches negative infinity, the optimal passive set approaches an empty set; when  approaches
positive infinity, the optimal passive set approaches the entire state space S . Intuitively, a state
that requires a greater subsidy  (i.e., a greater additional incentive) for it to join the passive set
44 3. VARIANTS OF THE BAYESIAN BANDIT MODEL
should enjoy higher priority for activation. This is the basic idea behind the Whittle index which
we define below.
Let V .s/ denote the value function representing the maximum expected total discounted
reward that can be accrued from the single-arm restless bandit process with subsidy  when the
initial state is s . Considering the two possible actions at the first decision time, we have

V .s/ D maxfQ .sI a D 0/; Q .sI a D 1/g; (3.7)

where Q .sI a/ denotes the expected total discounted reward obtained by taking action a at
t D 1 followed by the optimal policy for t > 1 and is given by
ı X ı
Q .sI a D 0/ D .r .s/ C / C ˇ p .s; s 0 /V .s 0 /; (3.8)
s 0 2S
X
Q .sI a D 1/ D r.s/ C ˇ p.s; s 0 /V .s 0 /: (3.9)
s 0 2S

The first terms in (3.8) and (3.9) are the immediate rewards obtained at t D 1 under the re-
spective actions of a D 0 and a D 1. The second terms are the total discounted rewards in the
remaining horizon determined by the value function due to the adoption of the optimal policy
for t > 1. The optimal policy  under subsidy  is given by
(

1; if Q .sI a D 1/ > Q .sI a D 0/
 .s/ D ; (3.10)
0; otherwise

where we have chosen to make the arm passive when the two actions are equally rewarding. The
passive set P ./ under subsidy  is given by

P ./ D fs W  .s/ D 0g D fs W Q .sI a D 0/  Q .sI a D 1/g: (3.11)

It is intuitive to expect that as the subsidy  increases, the passive set P ./ grows mono-
tonically. In other words, if the arm is made passive at state s under a subsidy , it should also be
made passive at this state under a subsidy 0 > . Unfortunately, as shown in the counterexample
below, this need not hold in general.

CounterExample 3.12 Consider an arm with state space f1; 2; 3g. At s D 1, the state transits,
deterministically, to 2 and 3 under a D 1 and a D 0, respectively; states 2 and 3 are absorbing
ı ı
states regardless of the actions. The rewards are given by r.1/ Dr .1/ D 0, r.2/ D 0, r .2/ D 10,
ı
r.3/ D 11, r .3/ D 0. The discount factor is ˇ D 21 .
It is not difficult to see that when the subsidy  is set to 0, the optimal action at s D 1 is to
be passive so that the arm enters state 3 and we collect the highest reward r.3/ D 11 by choosing
the active action at every subsequent time. When the subsidy increases to  D 2, however, the
optimal action at s D 1 is to be active so that the arm enters state 2 and we collect the reward
3.3. VARIATIONS IN THE SYSTEM DYNAMICS 45
ı
plus subsidy r .2/ C  D 12 at every subsequent time. The passive set P ./ actually decreases
from f1; 2g to f2g when the subsidy increases from  D 0 to  D 2.

When the passive set P ./ is not monotonic in , the subsidy for passivity does not induce
a consistent ordering of the states in terms of how rewarding it is to activate the arm, thus may
not be a meaningful measure for the index. This is the rationale behind the notion of indexability.

Definition 3.13 Indexability of Restless Bandit Model:


An arm in a restless bandit model is indexable if the passive set P ./ under subsidy  increases
monotonically from ; to S as  increases from 1 to C1. A restless multi-armed bandit model
is indexable if every arm is indexable.

We illustrate the concept of indexability in Figure 3.3. When an arm is indexable, there
is a consistent ordering of all its states in terms of when each state enters the passive set P ./
as the subsidy  increases from 1 to C1. As illustrated in Figure 3.3, imagine the states as
pebbles in an urn. If we liken increasing the subsidy  to pouring water into the urn, the order at
which the pebbles (states) get submerged in water (enter the passive set) induces a consistent and
meaningful measure on the positions of the pebbles above the bottom of the urn (how rewarding
it is to be active at each state). If an arm is not indexable, then there exists a state such as state 5
in Figure 3.3 that jumps out of “water” (the passive set) as the water level (the subsidy) increases
from  to 0 . The value  at which each state enters the passive set does not induce a consistent
ordering of the states. If it does not provide a consistent measure for comparing states, it may
not be a valid candidate for the index that aims to rank arms based on their states (consider, for
example, all the arms are stochastically identical with the same state space).

Indexable Not Indexable

5
1 1
Subsidy Subsidy
2 3 Subsidy
2 3 Subsidy
4 5 4 5
Passive Set Passive Set

Figure 3.3: Indexability and monotonicity of the passive set of a restless bandit model.

If an arm is indexable, its Whittle index .s/ of state s is defined as the least value of the
subsidy  under which it is optimal to make the arm passive at s , or equivalently, the minimum
subsidy  that makes the passive and active actions equally rewarding:
˚
.s/ D inf  W  .s/ D 0 D inf f W Q .sI a D 0/ D Q .sI a D 1/g : (3.12)
 
46 3. VARIANTS OF THE BAYESIAN BANDIT MODEL
The definitions of indexability and Whittle index direct lead to the following result on the
optimal policy for a single-armed restless bandit problem.
Theorem 3.14 For the single-armed restless bandit problem given in Definition 3.11, if it is index-
able, then the policy that activates the arm if and only if the current state has a positive Whittle index
is optimal, i.e., (

1 if .s/ > 0
 .s/ D : (3.13)
0 if .s/  0
Similarly, with a given subsidy , activating the arm at states whose Whittle indices exceed  is
optimal.

A discussion on the relation between the Whittle index and ıthe Gittins index is in order.
ı
The bandit model is a special case of the restless bandit model with p .s; s/ D 1 and r .s/ D 0 for
all s and for all arms. When a restless bandit problem reduces to a bandit problem, the Whittle
index reduces to the Gittins index. This can be seen by noticing that once the passive action is
taken at a particular time t0 , it will be taken at every t > t0 since the state of the arm is frozen.
Thus, a constant subsidy  is to be collected at each time t > t0 , which can be viewed as switching
to a standard arm offering a constant reward . The Whittle index induced by the subsidy for
passivity thus reduces to the Gittins index calibrated by a standard arm with a constant reward.
A less obvious fact is that the canonical bandit model defined in Definition 2.8 is index-
able under Whittle’s definition given in Definition 3.13. This was established by Whittle in his
original paper on restless bandits (Whittle, 1988 [212]). The basic idea is to show that a state
s in the passive set remains in the passive set as the subsidy increases. This is equivalent to a
positive slope of the gain under a D 0 over a D 1 for  > .s/, i.e.,
@
.Q .sI a D 0/ Q .sI a D 1//  0 (3.14)
@
for  > .s/. The above can be shown based on (3.8)–(3.9) and the fact that the derivative of
the value function with respect to  is the expected total discounted time that the arm is made
passive. The latter is quite intuitive: when the subsidy for passivity  increases, the rate at which
the total discounted reward V .s/ increases is determined by how often the arm is made passive.
The proof completes by noticing that the total discounted time in passivity starting at a state s
in the passive set (i.e.,  > .s/) is the entire discounted horizon 1 1ˇ due to the frozen state
under passivity in the bandit model.
For a general restless bandit problem, establishing its indexability can be complex. General
sufficient conditions remain elusive. Nevertheless, a number of special classes of the restless ban-
dit model have been shown to be indexable. See, for example, Glazebrook, Ruiz-Hernandez, and
Kirkbride, 2006 [94], for specific restless bandit processes that are indexable and Nino-Mora,
2001, 2007b [154, 158], for a numerical approach for testing indexability and calculating Whit-
tle index for finite-state bandit problems. In Section 6.1.1, we discuss a restless bandit model
3.3. VARIATIONS IN THE SYSTEM DYNAMICS 47
arising in communication networks for which the indexability can be established analytically
and the Whittle index obtained in closed form.

3.3.3 OPTIMALITY OF WHITTLE INDEX POLICY


We now return to the general restless bandit model with N > 1 arms defined in Definition 3.10.
The Whittle index policy for a general restless bandit problem is to activate the K arms of
currently greatest Whittle indices. It is in general suboptimal. In this section, we discuss several
scenarios in which its optimality can be established.

Optimality under a relaxed constraint: In the original restless bandit model, a hard constraint
on the number of active arms is imposed for each decision time. Consider a relaxed scenario
where the constraint is imposed on the expected total number of arms activated over the entire
horizon rather than at each given time. Specifically, suppose that a unit cost is incurred each
time an arm is activated. The constraint is on the total cost, discounted in the same way as the
rewards: "N 1 #
XX K
t 1
E ˇ I.ai .t/ D 1/ j S.1/ D s D ; (3.15)
tD1
1 ˇ
i D1

where S.1/ D ŒS1 .1/; : : : ; SN .1/ is the state vector. Activating exactly K arms each time is one
way to satisfy this constraint. We thus have the following relaxed MDP problem that admits a
larger set of policies:
hP i
N P1 t 1
maximize E i D1 tD1 ˇ ri .Si .t /I ai .t// j S.1/ D s
hP i (3.16)
N P1
subject to E i D1 tD1 ˇ
t 1
I.ai .t / D 0/ j S.1/ D s D N1 ˇK ;

where we have changed the constraint to an equivalent one on the number of passive arms for
reasons that will become clear later. The theorem below gives the optimal policy for this relaxed
problem in terms of the Whittle index.

Theorem 3.15 Let VN .s/ denote the value function of the relaxed restless bandit problem given
in (3.16). We have
(N )
X .i / N K
VN .s/ D inf V .si /  ; (3.17)
 1 ˇ
i D1

where V.i/ .si / is the value function of arm i with subsidy  as given in (3.7). This optimal value is
achieved by a policy that activates at each time all arms whose current Whittle indices exceed a threshold
 . This threshold  is the  value that achieves the minimum in (3.17), which ensures that the
constraint in (3.16) is met.
48 3. VARIANTS OF THE BAYESIAN BANDIT MODEL
Proof. The proof is based on a Lagrangian relaxation of the constrained optimization given
in (3.16). The resulting Lagrange function with multiplier  is
"N 1 #
XX
t 1
E ˇ ri .Si .t/I ai .t// j S.1/ D s (3.18)
iD1 t D1
"N 1 # !
XX N K
t 1
C E ˇ I.ai .t / D 0/ j S.1/ D s (3.19)
1 ˇ
i D1 tD1
"N 1 #
XX N K
D E ˇt 1
.ri .Si .t/I ai .t// C I.ai .t/ D 0// j S.1/ D s  (3.20)
1 ˇ
i D1 t D1
N
X .i/ N K
D V; .si /  ; (3.21)
1 ˇ
i D1

.i/
where V; .si / denote the expected total discounted reward obtained from arm i under policy 
when a subsidy  is obtained whenever the arm is made passive. Note that in the above Lagrange
function, arms are decoupled except sharing a common value  of the subsidy. The optimal policy
is thus the one that maximizes the discounted reward obtained from each subsidized arm. From
Theorem 3.14, the optimal policy is to activate arm i (i D 1; : : : ; N ) if and only if its current
Whittle index exceeds the subsidy  which is the critical value of the Lagrangian multiplier.
The expression for  follows from the fact that V.i / .si /, being the maximum of linear functions
.i/
V; .si / of , is convex and increasing in . 
The above proof shows clearly the interpretation of the subsidy  as the Lagrange multi-
plier for the relaxed constraint given in (3.16). Under the relaxed constraint, the Whittle index
policy is optimal.
Let V .s/ and VW .s/ denote, respectively, the value of the optimal policy and that of the
Whittle index policy for the original restless bandit model with the strict activation constraint.
It is easy to see that
VW .s/  V .s/  VN .s/; (3.22)
i.e., VN .s/ as specified in (3.17) provides an upper bound on the optimal performance. It can be
used as a benchmark for gauging the performance of policies under the original restless bandit
model.
Asymptotic optimality: Under the strict constraint that exactly K arms be activated each
time, Whittle conjectured that the index policy is asymptotically optimal under the average
reward criterion. Suppose that the N arms are of L different types. Arms of the same type l are
ı
stochastically identical with the same Markovian dynamics fpl .s; s 0 /; p l .s; s 0 /g and the same
ı
reward functions frl .s/g and fr l .s/g. Consider that as N approaches infinity, the composition
3.3. VARIATIONS IN THE SYSTEM DYNAMICS 49
of the arms approaches to a limit f˛l gL
lD1
,where ˛l is the proportion of type-l arms. For such
a population of arms with a limiting composition, we are interested in the average reward per
arm as N approaches infinity and K grows proportionally, i.e., K=N tends to a limit  .
Consider first a single arm of type l . Let gl ./ denote the optimal average reward offered
by this arm with subsidy :
1
gl ./ D sup lim E
 T !1 T
" T ˇ #
X ˇ
ı ˇ
rl .S t /I..S.t// D 1/ C .C r l .S.t ///I..S.t// D 0/ ˇ S.1/ D s ; (3.23)
ˇ
tD1

where the subscript l denotes the arm type rather than the arm label. We have also assumed,
for simplicity, that the arm states communicate well enough (more specifically, all policies result
in a Markov chain with a single recurrent class) that gl ./ is independent of the initial state.
A corresponding result of Theorem 3.14 under the average reward criterion states that gl ./ is
achieved by activating the arm at states with Whittle indices exceeding , where the Whittle
index is defined under the average reward criterion.
Let g.N;
N / denote the optimal average reward under the relaxed constraint that the time-
averaged number of active arms equals N . Based on a result similar to Theorem 3.15 for the
average reward criterion, we have
" L #
g.N;
N / X
lim D inf ˛l gl ./ .1 / : (3.24)
N !1 N 
lD1

Let gW .N; / denote the average reward per arm obtained by the Whittle index policy that
activates exactly the K D N arms of the greatest indices. We have the following conjecture by
Whittle.

Conjecture 3.16 Suppose that all arms are indexable. Then


g.N;
N /
lim gW .N; / D lim : (3.25)
N !1 N !1 N

Since the relaxed problem defines an upper bound on the original restless bandit model,
(3.25) implies the asymptotic optimality of the Whittle index policy in terms of average reward
per arm. The rationale behind this conjecture is based on the law of large numbers. For suffi-
ciently large N , the number of arms with indices exceeding the threshold  (the value of 
1
achieving the infimum in (3.24)) deviates from  only by a term of order at most N 2 . The
performance loss per arm of the Whittle index policy as compared to the optimal policy under
the relaxed constraint thus diminishes as N approaches infinity.
50 3. VARIANTS OF THE BAYESIAN BANDIT MODEL
Based on a fluid approximation of the Markov processes resulting from the Whittle index
policy, Weber and Weiss, 1990, 1991 [206, 207], have shown that the conjecture is true if the
differential equations of the fluid approximation have an asymptotic stable equilibrium point.
This condition can be difficult to check for a general restless bandit model, but is shown to hold
for all restless bandits with a state space size no greater than 3.
Optimality in special cases: A number of interesting cases exist in the literature in which the
Whittle index policy was shown to be optimal or near optimal. For instance, Lott and Teneket-
zis, 2000 [139], cast the problem of multichannel allocation in single-hop mobile networks with
multiple service classes as a restless bandit problem and established sufficient conditions for the
optimality of a myopic-type index policy. Raghunathan et al., 2008 [166], considered multicast
scheduling in wireless broadcast systems with strict deadlines and established the indexability
and obtained the analytical form of the Whittle index for the resulting restless bandit problem.
Ehsan and Liu, 2004 [76], formulated a bandwidth allocation problem in queuing systems as
a restless bandit and established the indexability. They further obtained the Whittle index in
closed-form and established sufficient conditions for the optimality of the Whittle index pol-
icy. The restless bandit model has also been applied to economic systems for handling inventory
regulation by Veatch and Wein, 1996 [201]. In Section 6.1.1, we consider a restless bandit prob-
lem arising from communication networks for which indexability and the performance of the
Whittle index policy can be established analytically.

3.3.4 COMPUTATIONAL APPROACHES TO RESTLESS BANDITS


In a series of papers, Bertsimas and Nino-Mora, 2000 [35] and Nino-Mora, 2001,
2002, 2006 [154–156] developed algorithmic and computational approaches based on linear
programming for establishing indexability and computing priority indexes for a general rest-
less bandit model. A comprehensive survey on this line of work can be found in Nino-Mora,
2007 [158]. Glazebrook, Ruiz-Hernandez, and Kirkbride, 2006 [94] gave several examples of
indexable families of restless bandit problems.

3.4 VARIATIONS IN THE REWARD STRUCTURE


3.4.1 BANDITS WITH REWARDS UNDER PASSIVITY
The assumption that passive arms generate no reward is not necessary for the optimality of the
Gittins index policy. This is welcome news in particular to applications such as queueing systems
and job scheduling where a holding cost incurs for each unserved queue/job. This type of bandit
problems where passive arms incur costs rather than active arms generating rewards are often
referred to as the tax problems.
The Gittins index can be extended, quite straightforwardly, to an arm with a non-zero
ı ı
reward function r .s/ in passivity. Recall that when r .s/ D 0 for all s 2 S , the Gittins index is
the maximum reward per discounted time attainable by activating the arm at least once ( 
3.4. VARIATIONS IN THE REWARD STRUCTURE 51
ı
1) and then stopping at an optimal time   (see (2.20)). With a non-zero r .s/, the index,
measuring the value for activating the arm at a particular state, should be given by the maximum
additional reward per discounted time attainable by activating the arm up to an optimal stopping
time    1 as compared to being passive all through. We thus have
P ˇ  h ˇ i
E ˇ t 1
r.S.t // ˇ S.1/ D s C 1 E ˇ  rı .S. // rı .s/ ˇˇ S.1/ D s
tD1 1 ˇ
.s/ D max P ˇ  :
t 1 ˇ S.1/ D s
E t D1 ˇ
 1
(3.26)
A more intuitive interpretation of this index form is the subsidy-for-passivity concept: .s/ is the
minimum subsidy for passivity needed to make the active and passive actions equally attractive
at state s . Specifically, under a subsidy for passivity given by the index .s/, the total discounted
reward for activating the arm until a stopping time  is upper bounded by that of being passive
all through:
"  ˇ #
X1 X ˇ
t 1 ı
t 1 ˇ
ˇ ..s/C r .s//  E ˇ r.S.t // ˇ S.1/ D s
ˇ
tD1 t D1
" 1 ˇ #
X ˇ
t 1 ı ˇ
CE ˇ .r .S. // C .s// ˇ S.1/ D s ; (3.27)
ˇ
tDC1

where equality holds for the optimal stopping time   . We thus arrive at (3.26).

3.4.2 BANDITS WITH SWITCHING COST AND SWITCHING DELAY


Assumption A3.3 is necessary for the index theory. It is not difficult to show that bandits with
switching cost are special cases of the restless bandit model.
Suppose that a setup cost ci .s/ is incurred when switching to arm i that is in state s . A
more general formulation includes, in addition to the setup cost, a tear-down cost dependent of
the state of the arm being switching away from. Banks and Sundaram, 1994 [29] showed that
the tear-down cost can be absorbed into the setup cost based on the simple fact that a setup cost
ci .s/ of arm i in state s must follow a tear-down cost of the same arm in the same state at some
earlier time instant. It is thus without loss of generality to assume only setup costs.
The bandit problem with switching cost can be formulated as a restless bandit (without
switching cost) by augmenting the state of arm i at time t with the active/passive action ai .t
1/ 2 f0; 1g at time t 1. The reward ri0 .s.t/; ai .t 1// for activating arm i is thus given by
ri0 .s; ai .t 1/ D 1/ D ri .s/; ri0 .s; ai .t 1/ D 0/ D ri .s/ ci .s/; (3.28)
where ri .s/ is the reward function for arm i in the bandit problem with switching cost. Pas-
sive arms generate no reward. The augmented state, however, becomes restless since arm i
in state .s; ai .t 1/ D 1/ at time t , when made passive, transits deterministically to state
.s; ai .t / D 0/ at time t C 1.
52 3. VARIANTS OF THE BAYESIAN BANDIT MODEL
This is a special restless bandit problem due to the specific and deterministic state tran-
sition in passivity. It has been shown by Glazebrook and Ruiz-Hernandez, 2005 [95] that this
restless bandit problem is indexable and the Whittle index can be efficiently computed via an
adaptive greedy algorithm. The optimality of the Whittle index, however, does not hold in gen-
eral. Sundaram, 2005 [186] has shown that no strongly decomposable index policies offer op-
timality for a general bandit with switching cost. A number of studies exist that either provide
sufficient conditions for index-type optimal policies or consider special cases of the problem.
See a comprehensive survey by Jun, 2004 [104].
Another form of switching penalty is in terms of a startup delay when re-initiating the
work on a project. The startup delay is modeled as a random variable dependent on the state of
the project being re-initiated. While imposing no direct switching cost, a delay affects the future
return through the discounting: all future rewards are discounted with an extra factor determined
by the state-dependent random delay. Similar to the case with switching cost, bandits with
switching delays can be formulated under the restless bandit model. The resulting bandit model,
however, becomes semi-Markovian. Studies on bandits with switching delays and switching
costs can be found in Asawa and Teneketzis, 1996 [18] and Nino-Mora, 2008 [159].

3.5 VARIATIONS IN PERFORMANCE MEASURE


3.5.1 STOCHASTIC SHORTEST PATH BANDIT
It is a well-established fact that maximizing the discounted reward with a discount factor ˇ over
an infinite horizon is equivalent to maximizing the total undiscounted reward over a random
horizon with a geometrically distributed length T given by P ŒT D n D .1 ˇ/ˇ n 1 for n D
1; 2; : : : (see, for example, Puterman, 2005 [164]). The value of T is independent of the policy
deployed. The index theory thus directly apply to this type of random horizon problems.
In this section, we show that the index theory extends to more complex random horizon
problems in which the horizon length depends on the policy. Quite often, the objective is to
maximize or minimize the expected horizon length. Such random horizon problems arise in
applications concerned with finding the most rewarding (or the least costly) path to a terminating
goal such as finding an object or accomplishing a task. The resulting bandit model is referred to
as the stochastic shortest path bandit, where the word “shortest” conforms to the convention that
edge/vertex weights represent costs and the word “stochastic” captures the random walk nature
of the problem in which our action only specifies probabilistically the next edge to traverse.

The stochastic shortest path bandit model: Before presenting the formal definition, we con-
sider the following illuminating example from Dumitriu, Tetali, and Winkler, 2003 [75].

Example 3.17 Consider a line graph with vertices 0; 1; : : : ; 5 as shown in Figure 3.4. There are
two tokens initially at vertices 2 and 5, respectively. A valuable gift at vertex 3 can be collected if
either token reaches 3. At any time, you may pay $1 for a random move made by a token of your
3.5. VARIATIONS IN PERFORMANCE MEASURE 53

0 1 2 3 4 5

Figure 3.4: A stochastic shortest path bandit problem.

choice. The chosen token moves to one of its neighbors with equal probability (this probability
is 1 if it has a single neighbor). What is the optimal strategy for moving the tokens?
For this simple problem, it is not difficult to see that the optimal strategy is to move
the token at 2 first and then switch permanently to the other token if the game does not end
immediately. The resulting expected total cost can be calculated as follows. Let x denote the
expected total cost for reaching the prize by moving the token located at vertex 5. It satisfies the
following equation:
1 1
x D 1 C  1 C  .1 C x/; (3.29)
2 2
which leads to x D 4. The total expected cost of the strategy given above is thus 12  1 C 12 x D 3.

We now formally define the stochastic shortest path bandit model. For consistency and
without loss of generality, we pose the problem as maximizing reward rather than minimizing
cost.

Definition 3.18 The Stochastic Shortest Path Bandit Model:


Consider N arms, each with state space Si (i D 1; : : : ; N ). At time t D 1; 2; : : :, based on the
observed states Œs1 .t/; : : : ; sN .t/ of all arms, one arm, say arm i , is activated. The active arm
offers a reward ri .si .t// dependent of its current state and changes state according to a Markov
transition rule pi .s; s 0 / (s; s 0 2 Si ). The states of all passive arms remain frozen. Each arm has
a terminating state  satisfying ri ./ D 0 and pi .; / D 1. The objective is to maximize the
expected total (undiscounted) reward until one arm reaches its terminating state for the first
time. We assume here that the expected number of steps to reach the terminating state from
every state s 2 Si is finite for all arms.

This bandit model appeared in the literature in various special forms. The general formu-
lation was considered by Dumitriu, Tetali, and Winkler, 2003 [75], under a vividly descriptive
title of “Playing Golf with Two Balls.” The formulation was given in terms of minimizing cost.
The extended Gittins index: We take the restart-in-state characterization of the Gittins index
and generalize it to the stochastic shortest path bandit model.
Consider a single arm and omit the subscript for the arm label. The index .s/ for state s is
the maximum expected total reward accrued starting in state s and with the option of restarting
54 3. VARIANTS OF THE BAYESIAN BANDIT MODEL
in state s at every decision time until the arm reaches its terminating state. In other words,
whenever we end up in a state less advantageous (in terms of the total reward until termination)
than the initial state s , we go back to s and restart. The decision problem is to find the optimal
set of such states for restarting. Let Q  S denote a subset of states in which we restart to s .
The resulting Markov reward process essentially aggregates all states in Q with state s . Its state
space is given by SQ D S n Q, and the transition probabilities p.u;Q u0 / are
(
0
p.u; u0 / if u; u0 2 SQ ; u0 ¤ s
p.u;
Q u/D P : (3.30)
w2Q[fsg p.u; w/ if u 2 SQ ; u0 D s

The cumulative reward until absorption in  of this Markov chain gives us the expected total
reward VQ .s/ until termination under the restart set Q. The index .s/ is then obtained by
optimizing the restart set Q:
.s/ D max VQ .s/: (3.31)
QS

Let Q .s/ denote the optimal restart set that attains the index value .s/. Similar to Property 2.5,
we have ˚ 0 ˚
s W .s 0 / < .s/  Q .s/  s 0 W .s 0 /  .s/ ; (3.32)
which states the intuitive: the optimal restart set (states less advantageous than the initial state
s ) for initial state s consists of states with indices smaller than that of s . This also leads to a
recursive algorithm, similar to that described in Section 2.5, for computing the indices one state
at a time, starting with the state with the greatest index value.
The following example, first posed by Bellman, 1957 [31], and later used by Kelly,
1979 [113], to illustrate the index theorem, fits perfectly under the stochastic shortest path
bandit model.
Example 3.19 The Gold-Mining Problem:
There are N gold mines, each containing initially xi (i D 1; 2; : : : ; N ) tons of gold. There is a
single gold-mining machine. Each time the machine is used at mine i , there is a probability qi
that it will mine a fraction i of the remaining gold at the mine and remain in working order
and a probability 1 qi that it will mine no gold and be damaged beyond repair. The objective
is to maximize the expected amount of gold mined before the machine breaks down.
The stochastic shortest path bandit formulation of the problem is quite straightforward:
each mine corresponds to an arm with the state being the amount of the remaining gold and a
terminating state  denoting the machine breaking down. When arm i is activated in state si
(i.e., the machine is used at mine i ), the reward and the transition probabilities are given by
ri .si / D q8i i si (3.33)
< qi
ˆ if si0 D .1 i /si
0
pi .si ; si / D 1 qi if si0 D  : (3.34)

0 otherwise
3.5. VARIATIONS IN PERFORMANCE MEASURE 55
Since the state and the associated reward decrease with each mining, it is easy to see that
in the restart-in-state problem, the optimal action is to restart in state si right after each mining
if the machine has not broken down. The index i .si /, the expected total reward obtained with
restart until the machine breaks down, is given by

i .si / D ri .si / C qi i .si /; (3.35)

which leads to
qi i si
i .si / D : (3.36)
1 qi
The optimal policy is to assign the machine to the mine i  D arg maxi D1;:::;N i .si / with the
greatest current index value.

3.5.2 AVERAGE-REWARD AND SENSITIVE-DISCOUNT CRITERIA


The average-reward criterion can be rather unselective. Assume that the states of each arm com-
municate well enough that the average reward does not depend on the initial state. If the ob-
jective is to simply maximize the average reward, also referred to as the gain denoted by g , a
simple policy that activates, at each t , the arm with the highest limiting reward rate suffices.
Specifically, let f i .s/gs2Si denote the unique stationary distribution of arm i . The gain gi (i.e.,
the limiting reward rate) of arm i is given by
X
gi D r.s/ i .s/: (3.37)
s2Si

The policy that plays a fixed arm i  D arg maxi D1;:::;N gi all through suffices to maximize the
average reward.
A stronger optimality criterion is to require both gain optimal and bias h.s/ optimal, where
the bias function h.s/ is the differential reward caused by the transient effect of starting at state
s D Œs1 ; : : : ; sN  rather than the stationary distributions of the arms. The following modified
Gittins index leads to an optimal policy under this criterion:
P ˇ 
ˇ
E t D1 ri .Si .t// Si .1/ D s
i .s/ D max : (3.38)
1 E Œ j Si .1/ D s

Other stronger optimality criteria such as Veinott optimality (Veinott, 1966 [202])
and Blackwell optimality (Blackwell, 1962 [41]) (i.e., simultaneous maximization of total-
discounted-reward under all discount factors sufficiently close to 1) were considered in the bandit
setting by Katehakis and Rothblum, 1996 [111]. It was shown that the decomposition structure
of the optimal policy is preserved under these criteria and can be computed through a Laurent
expansion of the Gittins index (as the value under restarting) around ˇ D 1.
56 3. VARIANTS OF THE BAYESIAN BANDIT MODEL
3.5.3 FINITE-HORIZON CRITERION: BANDITS WITH DEADLINES
A finite-horizon bandit model under the total-reward criterion (see (2.2)) can be generalized to
include cases where projects/arms have different deadlines for completion. We have discussed in
Section 3.1.4 that the index theory does not hold for a finite-horizon bandit, since the maximum
reward rate promised by the Gittins index may only be realized by a stopping time that exceeds
the remaining time horizon. This leads to a natural question on whether a modified Gittins index
resulted from searching over a constrained set of stopping times determined by the remaining
time horizon might offer an optimal solution. The answer, unfortunately, is negative as discussed
by Gittins and Jones, 1974 [87].
The finite-horizon problem can, however, be formulated as a restless bandit problem by
augmenting the state of each project with the remaining time to the deadline. This leads to a
special class of restless bandit problems where the state of a passive arm changes in a determinis-
tic way (i.e., the remaining time to the deadline is reduced by one at each time). The indexability
and priority index policies for this class of restless bandit problems were studied by Nino-Mora,
2011 [160].
57

CHAPTER 4

Frequentist Bandit Model


In this chapter, we introduce basic formulations and major results of the bandit problems within
the frequentist framework. Toward the end of this section, we draw connections between the
Bayesian and the frequentist bandit models from an algorithmic perspective. In particular, we
discuss how learning algorithms developed within one framework can be applied and evaluated
in another. A success story is the Thompson Sampling algorithm. Originally developed from
a Bayesian perspective (see Example 2.3), it has recently gain wide popularity as a solution to
frequentist bandit problems and has enjoyed much fame due to its strong performance under
frequentist criteria as demonstrated both empirically and analytically.

4.1 BASIC FORMULATIONS AND REGRET MEASURES

Definition 4.1 The Frequentist Multi-Armed Bandit Model:


Consider an N -armed bandit and a single player. At each time t D 1; : : : ; T , the player chooses
one arm to play. Successive plays of arm i (i D 1; : : : ; N ) generate independent and identically
distributed random rewards Xi .t/ drawn from an unknown distribution Fi .x/ with a finite (un-
known) mean i . The objective is to maximize the expected value of the total reward accrued
over T plays.

An arm-selection policy  is a sequence of functions f1 ; : : : ; T g, where  t maps from


the player’s past actions and reward observations to the arm to be selected at time t . Let V .T I F/
denote the total expected reward obtained over T plays under policy  for a given configuration
of the reward distributions F D .F1 ; : : : ; FN /. With a slight abuse of notation, let  t denote the
arm chosen at time t under policy  , and Xt .t/ the random reward obtained at time t . We have
" T #
X
V .T I F/ D E X t .t / (4.1)
t D1
N X
X T
D E ŒX t .t/I. t D i/ (4.2)
i D1 tD1
N
X
D i E Œi .T / ; (4.3)
i D1
58 4. FREQUENTIST BANDIT MODEL
P
where i .T / D TtD1 I. t D i / is the total number of times that policy  chooses arm i , whose
dependence on F is omitted in the notation for simplicity. Note that (4.3) follows from Wald’s
identity.

4.1.1 UNIFORM DOMINANCE VS. MINIMAX


It is clear from (4.3) that the expected total reward V .T I F/ obtained under a policy  depends
on the unknown arm configuration F. This brings two issues if V .T I F/ is adopted as a measure
for the performance of a policy  .
The first issue is how to compare policies and define optimality. A policy that always plays
arm 1 would work perfectly, achieving the maximum value of V .T I F/ for arm configurations
satisfying 1 D maxi fi g. This also makes characterizing the fundamental performance limit a
trivial task: maxi fi g T is the achievable upper bound on V .T I F/ under every given F.
It goes without saying that such heavily biased strategies that completely forgo learning are
of little interest. They should be either excluded from consideration or penalized when measuring
their performance. The former leads to the uniform-dominance approach to comparing policies
and defining optimality, the latter the minimax approach.
In the uniform-dominance approach, the objective is a policy whose performance uni-
formly (over all possible arm configurations F in a certain family F of distributions) dominates
that of others. Such a policy, while naturally fits the expectation for optimal policies, does not
exist in general without a proper restriction on the class of policies being considered. More
specifically, the bandit formulation under the uniform-dominance approach involves specifying
two sets: the class … of admissible policies (which an optimal policy needs to dominate) and the
set F N of all possible arm configurations (over which the dominance needs to hold uniformly).
The more general these two sets are, the stronger the optimality statement.
In the minimax approach, every policy is admissible. The performance of a policy, how-
ever, is measured against the worst possible arm configuration F specific to this policy and the
horizon length T . The worst-case performance of a policy no longer depends on F. Meaningful
comparison of all strategies can be carried out and optimality can be rigorously defined. This
approach can also be viewed through the lens of a two-player zero-sum game where the arm
configuration F is chosen by nature who plays the role of an opponent.
This brings us to the second issue with using the total expected reward V .T I F/ as the
performance measure. The total expected reward is limited by the arm configuration F. A policy
effective at learning the arm characteristics may still return low utility in terms of the total reward
accrued simply due to the given arm configuration. In particular, using reward as the payoff in
the minimax approach leads to a trivial game: the action of nature is to set fi g as small as
possible, and the action of the player is to choose the arm for which the smallest possible mean
is the greatest.
A modified objective is thus necessary, one that measures the efficacy of a policy in learning
the unknown arm configuration F rather than the absolute return of rewards. A natural candidate
4.1. BASIC FORMULATIONS AND REGRET MEASURES 59
is the difference in the total expected reward against an oracle who knows the arm configuration
F and plays optimally using that knowledge. This leads to the notion of regret as discussed below.

4.1.2 PROBLEM-SPECIFIC REGRET AND WORST-CASE REGRET


Suppose that F is known. The optimal action is obvious: choose, at each time, the arm with the
greatest mean value:
 D maxf1 ; : : : ; N g: (4.4)
The omniscient oracle enjoys the maximum expected reward  per play and the maximum
expected total reward  T over the horizon of length T . When F is unknown, it is inevitable
that an inferior arm with a mean less than  is played from time to time. The question is then
how much the loss is as compared to the maximum expected total reward  T in the case of
known F. This cumulative reward loss is referred to as regret or cost of learning and is given by

R .T I F/ D  T V .T I F/ (4.5)
X
D . i /E Œi .T / ; (4.6)
i Wi <

P
where (4.6) follows from (4.3) and that N iD1 E Œi .T / D T .
In the uniform-dominance approach, a policy  in the class … of admissible policies is
optimal if, for all admissible policies  0 2 … and all arm configurations F 2 F N , we have

R .T I F/  R 0 .T I F/: (4.7)

We refer to R .T I F/ as the problem-specific regret to highlight its dependence on F and differ-


entiate it from the worst-case regret defined below.
In the minimax approach, the worst-case regret of a policy  is defined as

RN  .T I F / D sup R .T I F/: (4.8)
F2F N

Note that the worst-case regret RN  .T I F / is no longer a function of F, but a function of the set
F of possible reward distributions. We often omit this dependence in the notation when there
is no ambiguity. A policy  is minimax optimal if, for all  0 , we have

RN  .T I F /  RN  0 .T I F /:

A policy can be evaluated under both the uniform-dominance approach and the minimax
approach. It holds trivially that its problem-specific regret is no greater than its worst-case regret.
Specifically, for all F 2 F N , we have

R .T I F/  RN  .T I F /:
60 4. FREQUENTIST BANDIT MODEL
As we will see in subsequent sections, for the same policy, these two regret measures may have
drastically different scaling behaviors in T . It is also natural to expect that optimality under one
regret measure need not transfer to the other, although policies that are order optimal under
both measures do exist.
Of particular interest is the rate at which R .T I F/ and RN  .T I F / grow with T . A sublin-
ear regret growth rate implies that the maximum expected reward  per play can be approached
as the learning horizon T tends to infinite. Thus, all policies offering a sublinear regret order are
optimal in terms of maximizing the long-term average reward. Regret, however, is a finer per-
formance measure than the long-term time average. It not only indicates whether the long-term
average reward converges to  , but also measures the rate of the convergence. For example, a
policy with R .T I F/  O.logp T /, while converging to the same expected reward per play as a
policy with R 0 .T I F/  O. T /, is far more superior in terms of learning efficiency.
A policy  is asymptotically optimal in terms of its problem-specific regret if for all F 2 F N ,
R .T I F/
lim inf D 1: (4.9)
T !1 inf 0 2… R 0 .T I F/
A policy  is order optimal if for all F 2 F N ,
R .T I F/
lim inf < C.F/ (4.10)
T !1 inf 0 2… R 0 .T I F/
for some constant C.F/ independent of T , but dependent of F in general.
Asymptotic optimality under the minimax regret criterion can be similarly defined. For
order optimality, however, the constant C corresponding to that in (4.10) is no longer a function
of the arm configuration F. Specifically, a policy  is minimax order optimal if there exists a
constant C such that
RN  .T I F /
lim inf < C: (4.11)
N  0 .T I F /
T !1 inf 0 R

4.1.3 REWARD DISTRIBUTION FAMILIES AND ADMISSIBLE POLICY


CLASSES
Under both the uniform-dominance approach and the minimax approach, the family of all pos-
sible reward distributions needs to be specified. For the uniform dominance approach, the class
of admissible policies also need to be specified. Below we discuss commonly adopted distribution
families and policy classes.
Family of reward distributions: Often, the specific application under consideration imposes
natural restrictions on the reward distributions that the arms may assume. For example, in the
clinical trial application with dichotomous responses, the random rewards from each arm are
known to follow a Bernoulli distribution. There is no reason to require dominance under dis-
tributions outside this family of distributions, since generality in models often comes with a
sacrifice in targeted performance.
4.1. BASIC FORMULATIONS AND REGRET MEASURES 61
The family F of possible reward distributions depends on the specific application and
available prior knowledge. General approaches to specifying F fall under two categories: para-
metric and nonparametric. In the parametric setting, F is defined by a known distribution type
F .xI  / with an unknown parameter  belonging to a known set ‚:

F D fF .xI  / W  2 ‚g: (4.12)


Under this setting, all arms assume the same type of distribution and differ only in the parameter
 . An example is the clinical trial application where all arms assume a Bernoulli distribution with
an unknown mean  2 Œ0; 1. The parameter  can be a vector to allow multivariate distributions,
for example, a Gaussian distribution parameterized by mean and variance.
A nonparametric characterization of F does not stipulate a specific distribution, but rather
imposing certain conditions, often on the concentration behavior of the distributions. Com-
monly considered in the literature are—from the most concentrated to the least concentrated—
the set of distributions with bounded support, sub-Gaussian distributions, light-tailed distribu-
tions, and heavy-tailed distributions. Specifying F by imposing constraints on the concentration
behavior is quite natural. A less concentrated distribution leads to less accurate inference of its
mean from random samples, thus more resistance in terms of achievable regret performance and
more demanding in terms of learning algorithm design.
Class of admissible policies: A natural way to define the class … of admissible policies is to
consider strategies that offer good performance for all arm configurations in F N , hence exclud-
ing heavily biased policies.
Definition 4.2 A policy  is consistent if for all F 2 F N , it asymptotically achieves the highest
average reward per play  :
E ŒV .T I F/
lim D  : (4.13)
T !1 T
Equivalently, a consistent policy  experiences diminishing regret per play as T tends to infinity:
E ŒR .T I F/
lim D 0: (4.14)
T !1 T
This is often referred to as Hannan consistency originally introduced in an adversarial game
setting (Hannan, 1957 [98]).
For a given ˛ 2 .0; 1/, a policy is ˛ -consistent if for all F 2 F N , it satisfies
E ŒR .T I F/
lim D 0: (4.15)
T !1 T˛
A policy that is ˛ -consistent for all ˛ 2 .0; 1/ is called uniformly good.

Let …c , …˛ , and …u denote, respectively, the classes of consistent, ˛ -consistent, and


uniformly good policies. It is easy to see that …u  …˛  …c for all ˛ 2 .0; 1/.
62 4. FREQUENTIST BANDIT MODEL
4.2 LOWER BOUNDS ON REGRET
In this section, we establish lower bounds on achievable regret under both the uniform-
dominance and minimax formulations. These lower bounds serve as benchmarks for determining
the asymptotic or order optimality of learning policies.

4.2.1 THE PROBLEM-SPECIFIC REGRET


Recall that Xi .t / denotes the random reward offered by arm i at time t . In the parametric
setting, Xi .t / is drawn from a univariate cumulative distribution function F .xI i / , where F .I /
is known and the parameter i is unknown belonging to a parameter space ‚. For notation
simplicity, we assume that F .xI i / is differentiable and work with the corresponding density
function f .xI i /.
R 1 Let  D .1 ; : : : ; N / denote the vector of the unknown parameters. Let .i / D
xD 1 xf .xI i /dx denote the expected reward of arm i . Let  D maxi D1;:::;N .i / and
 D arg maxi D1;:::;N .i /. The regret R .T I / of policy  can be written as
X
R .T I / D . .i // E Œi .T / ; (4.16)
i W.i /<

where we have replaced the notation R .T I F/ by R .T I / to make explicit the dependency on


 in the parametric setting.
Let D.jj/ WD D.f .xI /jjf .xI // denote the Kullback–Leibler (KL) divergence be-
tween f .xI / and f .xI / defined as
  Z 1
f .XI / f .xI /
D.jj/ D EX f .xI/ log D f .xI / log dx: (4.17)
f .X I / xD 1 f .xI /
Specifically, the KL divergence D. jj/ is the expected value of the log-likelihood ratio of a
random sample X with respect to the two distributions f .xI  / and f .xI / where the random
sample X is drawn from the first distribution f .xI  /. It measures the rate (in terms of the
number of samples) at which the two distributions can be distinguished based on samples drawn
from the first distribution. The smaller the KL divergence, the harder it is to distinguish the two
distributions. KL divergence is nonnegative (with zero value attained if and only if the two
distributions are identical) and generally asymmetric in the two distributions.
We assume the following regularity conditions on the univariate distributions and the
parameter space.
Assumption 4.3
A4.1 (Denseness of ‚:) For all ı > 0 and  2 ‚, there exists  0 2 ‚ such that . / < . 0 / <
. / C ı .
A4.2 (Identifiability via the mean:) The univariate distribution function F .I / is such that for all
;  2 ‚, if . / < ./, then 0 < D. jj/ < 1.
4.2. LOWER BOUNDS ON REGRET 63
A4.3 (Continuity of D. jj/ in :) For all  > 0 and ;  2 ‚ with . / < ./, there exists
ı > 0 such that jD. jj/ D. jj0 /j <  whenever ./  .0 /  ./ C ı .

Theorem 4.4 Lower Bound on Problem-Specific Regret in Univariate Parametric Setting:


Under the regularity conditions A4.1–A4.3, for every ˛ -consistent policy  and every arm configura-
tion  such that .i / are not all equal, we have

R .T I / X  .i /
lim inf  .1 ˛/ : (4.18)
T !1 log T D.i jj /
iW.i /<

For every uniformly good policy, we have

R .T I / X  .i /
lim inf  : (4.19)
T !1 log T D.i jj /
iW.i /<

Theorem 4.4 implies that to achieve uniformly good performance over all arm configura-
tions, each suboptimal arm with .i / <  needs to be explored, asymptotically, no fewer than
1
D.i jj /
log T times, which grows in a logarithmic order with T and is inversely proportional to
the distribution divergence D.i jj / between this suboptimal arm and the optimal arm.

Proof. The proof is based on the following key lemma that establishes a high-probability lower
bound on the number of times that every suboptimal arm has to be explored to ensure ˛ -
consistency.

Lemma 4.5 Consider an arbitrary ˛ -consistent policy and anbitrary  2 ‚N . For every suboptimal
arm i with .i / <  , we have, for all  > 0,
 
.1 ˛/ log T
lim P i .T /  .1 / D 1; (4.20)
T !1 D.i jj /

where P denotes the probability measure induced by  with the dependence on the policy omitted from
the notation.

To prove the lemma, assume that, without loss of generality,  D .1 / and .2 / < 
(i.e., arm 1 is an optimal arm, not necessarily unique, and arm 2 a suboptimal arm). It suffices
to prove (4.20) for i D 2.
Fix 0 < ı <  . Based on A4.1–A4.3, there exists 20 2 ‚ such that .20 / >  and

0 < D.2 jj20 / D.2 jj1 / < ıD.2 jj1 /: (4.21)


64 4. FREQUENTIST BANDIT MODEL
It thus suffices to show
 
.1 ˛/ log T
lim P 2 .T / < c1 D 0; (4.22)
T !1 D.2 jj20 /
where c1 D .1 /.1 C ı/ satisfying c1 < 1. In other words, to distinguish arm configuration
 under which arm 1 is optimal from that of  0 D Œ1 ; 20 ; 3 ; : : : ; N  under which arm 2 is
log T
optimal, the probability that an ˛ -consistent policy plays arm 2 less than c1 .1D.˛/ jj 0 times has
2 2/
to diminish to zero.
Let X.1/; X.2/; : : : be successive observations from arm 2, where for notation simplicity,
we have omitted the arm index and labeled the reward samples from arm 2 with consecutive
integers. Define the sum log-likelihood ratio of m samples from arm 2 with respect to the two
distributions parameterized by 2 and 20 , respectively:
m
X f .X.j /I 2 /
Lm D log : (4.23)
f .X.j /I 20 /
j D1

Choose c2 2 .c1 ; 1/. We have


   
.1 ˛/ log T .1 ˛/ log T
P 2 .T / < c1 0 D P 2 .T / < c1 ; L2 > c2 .1 ˛/ log T
D.2 jj2 / D.2 jj20 /
 
.1 ˛/ log T
C P 2 .T / < c1 ; L2  c2 .1 ˛/ log T :
D.2 jj20 /
(4.24)
Next, we show each of the two terms on the right-hand side of the above equation tends to zero
as T approaches infinity. For the first term, we have
 
.1 ˛/ log T
P 2 .T / < c1 ; L2 > c2 .1 ˛/ log T
D.2 jj20 /
2 3

 P 4 max Lm > c2 .1 ˛/ log T 5 (4.25)


.1 ˛/ log T
mc1 0/
D.2 jj2
2 3
Lm c2
D P 4 max > D.2 jj20 /5 : (4.26)
mc1
.1 ˛/ log T
0/
c1 .1 ˛/ log T =D.2 jj20 / c1
D.2 jj2

By the strong law of large numbers, m1 Lm converges to D.2 jj20 / almost surely under  , which
implies that m1 maxkm Lk converges to D.2 jj20 / almost surely. Noticing that c2 =c1 is bounded
away from 1, we conclude, from (4.26), that
 
.1 ˛/ log T
lim P 2 .T / < c1 ; L 2 > c 2 .1 ˛/ log T D 0: (4.27)
T !1 D.2 jj20 /
4.2. LOWER BOUNDS ON REGRET 65
We now consider the second term on the right-hand side of (4.24):
 
.1 ˛/ log T
P 2 .T / < c1 ; L2  c2 .1 ˛/ log T
D.2 jj20 /
  
.1 ˛/ log T
D E I 2 .T / < c1 ; L2  c2 .1 ˛/ log T (4.28)
D.2 jj20 /
   
.1 ˛/ log T L2
D E 0 I 2 .T / < c1 ; L 2  c2 .1 ˛/ log T e (Change of measure)
D.2 jj20 /
(4.29)
   
.1 ˛/ log T
 E 0 I 2 .T / < c1 ; L2  c2 .1 ˛/ log T e c2 .1 ˛/ log T (4.30)
D.2 jj20 /
 
.1 ˛/ log T
D T c2 .1 ˛/ P 0 2 .T / < c1 ; L2  c2 .1 ˛/ log T (4.31)
D.2 jj20 /
 
.1 ˛/ log T
 T c2 .1 ˛/ P 0 2 .T / < c1 (4.32)
D.2 jj20 /
 
c2 .1 ˛/ .1 ˛/ log T
DT P 0 T 2 .T /  T c1 (4.33)
D.2 jj20 /
E 0 ŒT 2 .T /
 T c2 .1 ˛/ log T
(Markov inequality) (4.34)
T c1 .1D.˛/ jj 0/
2 2
c2 .1 ˛/ ˛
T T
 log T
(˛ -consistency); (4.35)
T c1 .1D.˛/ jj 0
2 2/

where (4.35) follows from the fact that under arm configuration  0 where arm 2 is the unique
optimal arm, an ˛ -consistent policy plays the N 1 suboptimal arms fewer than T ˛ times in
expectation.
Since c2 < 1, the above leads to
 
.1 ˛/ log T
lim P 2 .T / < c1 ; L2  c2 .1 ˛/ log T D 0: (4.36)
T !1 D.2 jj20 /

We thus arrive at (4.22), concluding the proof for Lemma 4.5. Theorem 4.4 then follows based
on (4.16). 

The lower bound under the parametric setting given in Theorem 4.4 can be extended
to the nonparametric setting following a similar line of proof. We have the following re-
sult (Agrawal, 1995 [4]).

Theorem 4.6 Lower Bound on Problem-Specific Regret in Nonparametric Setting:


Let F denote the set of possible arm distributions. Let  be a uniformly good policy over F . We have,
66 4. FREQUENTIST BANDIT MODEL
for all F 2 F N ,
R .T I F/ X  i
lim inf  ; (4.37)
T !1 log T D.Fi jj. /C /
i Wi <

where, for a distribution F 2 F and a real number c ,



D.F jjc C / D inf inf D.F jjF 0 / (4.38)
>c F 0 2F WF 0 D

is the infimum of the KL divergence between F and a distribution in F with a mean greater than c .

4.2.2 THE MINIMAX REGRET


We now establish a lower bound on the regret under the minimax approach. This result traces
back to as early as Vogel, 1960b [204]. The proof provided here largely follows that in Bubeck
and Cesa-Bianchi, 2012 [45].

Theorem 4.7 Lower Bound on Minimax Regret:


Let Fb denote the set of distributions with bounded support on Œ0; 1. For every policy  , we have
1p
R .T I Fb /  NT : (4.39)
20

The minimax nature of the formulation implies that the above lower bound holds for all
distribution sets that include bounded-support distributions, for example, sub-Gaussian, light-
tailed, and heavy-tailed distributions. The question is in the achievability of this lower bound
for these more general sets of distributions.
Techniques for establishing a lower bound on the minimax regret p are quite different from
that on the problem-specific regret. The intuition behind the order O. N T / lower bound on
the minimax regret is as follows. The Chernoff-Hoeffding inequality states that for any given
confidence level, the sample mean calculated from n independent samples p concentrates around
the true mean in a confidence interval with length proportional to 1=n. For an N -armed
bandit with a horizon length T , at least one arm is pulled
p no more than T =N times, resulting in
a confidence interval of length at least in the order of N=T . Inpother words, if the mean values
of the best arm and the second best arm differ in the order of N=T , these two arms cannot
bepdifferentiated
p with a diminishing probability of error. The resulting regret is in the order of
T N=T D N T .
Proof. In establishing a lower bound on the minimax regret, it suffices to consider a specific
distribution in the set of consideration. The choice here is the Bernoulli distribution Bp where p
denotes the mean. Based on the intuition stated above, we consider a bandit problem in which
4.2. LOWER BOUNDS ON REGRET 67
one arm (the best arm) is B 1 C and the rest N 1 arms are B 1 . The worst-case regret of every
2 2
policy is lower boundedp by the regret incurred in this bandit problem. We then show that setting
 to be proportional to N=T maximizes the regret, thus leading to the tightest lower bound.
Let F.i/ denote the arm configuration under which arm i is B 1 C and the rest N 1 arms
2
are B 1 . Let E.i/ and P.i/ denote, respectively, the expectation and the probability measure with
2
respect to F.i/ . We have, for an arbitrary policy  ,
RN  .T /  max R .T I F.i / / (4.40)
i D1;:::;N
N
1 X
 R .T I F.i / / (4.41)
N
iD1
N
 X 
D T E.i / Œi .T / ; (4.42)
N
iD1

where i .T / denotes the total number of times that arm i is pulled under  . In the following,
we omit its dependence on T from the notation. The inequality (4.41) follows from the fact that
the maximum is no smaller than the average, and the equality (4.42) follows from (4.6).
Next, we upper bound E.i/ Œi  (i D 1; : : : ; N ). The key idea is to use an arm configura-
tion F.0/ with all arms following B 1 as the benchmark and show that i under F.i / and F.0/ are
2
sufficiently close in expectation.
Let E.0/ , and P.0/ be the afore-defined notations under the benchmark configuration F.0/ .
Let X D .X1 .1/; X2 .2/; : : : ; XT .T // denote the random reward sequence seen by policy  .
For a deterministic policy  , the first arm selection 1 is predetermined, and the arm selection
 t at time t is determined by the past seen rewards X1 .1/; : : : ; X t 1 .t 1/. Thus, a given re-
alization x D .x.1/; : : : ; x.T // of the reward sequence determines the sequence of arm selection
actions .1 ; : : : ; T /, hence the value of i . The probability of X D x under F.i/ is given by
P.i/ ŒX D x D P.i/ ŒX1 .1/ D x.1/
T
Y  
P.i/ X t .t/ D x.t/jX1 .1/ D x.1/; : : : ; X t 1
.t 1/ D x.t 1/ (4.43)
t D2
T 
Y 
D B 1 .x.t//IŒ t ¤ i C B 1 C .x.t //I. t D i/ : (4.44)
2 2
tD1

Note that for a fixed reward sequence x, the sequence of actions .1 ; : : : ; T / under  is fixed.
The indicator functions in (4.44) hence take fixed values.
Similarly, under F.0/ , we have
T
Y
P.0/ .x/ D B 1 .x.t //: (4.45)
2
t D1
68 4. FREQUENTIST BANDIT MODEL
We now bound the difference between E.i / Œi  and E.0/ Œi  as follows. Let i .x/ denote
the total number of plays on arm i under a given realization x of the reward sequence:
X 
E.i/ Œi  E.0/ Œi  D i .x/ P.i/ .x/ P.0/ .x/ (4.46)
x
X 
 i .x/ P.i / .x/ P.0/ .x/ (4.47)
xWP.i/ .x/P.0/ .x/
X 
 T P.i/ .x/ P.0/ .x/ (4.48)
xWP.i/ .x/P.0/ .x/

 T jjP.i / P.0/ jjTV ; (4.49)

where jjP.i/ P.0/ jjTV denote the total variation distance between two distributions. Pinsker’s
inequality states that r
1
jjP.i/ P.0/ jjTV  D.P.0/ jjP.i/ /: (4.50)
2
The KL divergence D.P.0/ jjP.i / /, based on (4.45) and (4.44), can be written as
 
P.0/ .X/
D.P.0/ jjP.i/ / D E.0/ log (4.51)
P.i/ .X/
" T !#
X B 1 .X.t // B 1 .X.t//
2 2
D E.0/ log I. t ¤ i / C log I. t D i/ (4.52)
t D1
B 1 .X.t // B 1 C .X.t//
2 2
" #
XT B 1 .X.t//
2
D E.0/ log I. t D i/ (4.53)
tD1
B 1
C .X.t //
2
1 1
D log E.0/ Œi : (4.54)
2 1 4 2
This leads to r
T 1
E.i/ Œi   E.0/ Œi  C log E.0/ Œi : (4.55)
2 1 4 2
Substituting this upper bound on E.i/ Œi  into (4.42), we have
N r !
 X T 1
RN  .T /  T E.0/ Œi  log E.0/ Œi  (4.56)
N 2 1 4 2
i D1
N r N q
 X T 1 X
D T E.0/ Œi  log E.0/ Œi  (4.57)
N 2N 1 4 2
i D1 iD1
r
 T 1 p
 T T log NT ; (4.58)
N 2N 1 4 2
4.3. ONLINE LEARNING ALGORITHMS 69
PN PN p
where the last line was obtained using the fact that i D1 E.0/ Œi  D T and i D1 E.0/ Œi  
q P
N N i D1 E.0/ Œi  (the latter can be shown by applying Cauchy–Schwarz inequality to vectors
p p
.1; 1; : : : ; 1/ and . E.0/ Œ1 ; : : : ; E.0/ ŒN /). We then arrive at the theorem by setting  D
q
1 N
4 T
and using the inequality log 1 1 y  4 log. 34 /y for y 2 Œ0; 41 .
For a randomized policy, the arm selection action at time t is random, drawn from a
distribution determined by past reward observations. It can be equivalently viewed as following
a deterministic policy  D .1 ; : : : ; T / with a certain probability determined by the action
distribution at each time in a product form. The lower bound thus applies to a randomized
policy since it holds for each realization—a deterministic policy—of the randomized policy.


4.3 ONLINE LEARNING ALGORITHMS


In this section, we present representative online learning algorithms that are asymptotically op-
timal or order optimal.

4.3.1 ASYMPTOTICALLY OPTIMAL POLICIES


Lai and Robbins developed a general method for constructing asymptotically optimal policies,
i.e., policies that attain the asymptotic lower bound on the problem-specific regret given in The-
orem 4.4.
Under the Lai–Robbins policy, two statistics are maintained for each arm. One is a point
estimate O i .t / of the mean of each arm i based on past observations. This statistic is used to
ensure that an apparently superior arm is sufficiently exploited. The other statistic is the so-called
upper confidence bound Ui .t /, which represents the potential of an arm: the less frequently an
arm is played, the less confidence we have in its estimated mean, and the higher the potential of
this arm. This statistic is used to ensure that each seemingly inferior arm is sufficiently explored
to achieve a necessary level of confidence in the assertion on its inferiority. The upper confidence
bound (UCB) of each arm depends on not only the number of plays on that arm but also the
current time t in order to measure the frequency of exploration.
Based on these two statistics, the Lai–Robbins policy operates as follows. At each time t ,
the arm to be activated is chosen from two candidates: the leader l.t/ which is the arm with the
largest estimated mean among all “well-sampled” arms defined as arms that have been played at
least ıt times for some predetermined constant 0 < ı < N1 , and the round-robin candidate r.t/
which rotates among all arms. The leader l.t / is played if its estimated mean exceeds the UCB
of the round-robin candidate r.t/. Otherwise, the round-robin candidate r.t/ is chosen. This
policy is a modification of the follow-the-leader rule which greedily exploits without exploration.
What remains is to specify the point estimate O i .t/ of the mean and the UCB Ui .t/ for all
t D 1; 2; : : : and i D 1; : : : ; N . Let i .t/ denote the number of times that arm i has been played
70 4. FREQUENTIST BANDIT MODEL
Algorithm 4.2 Lai–Robbins Policy
Notations: i .t/: the number of times that arm i is pulled in the first t plays;
Notations: O i .t/: the point estimate of the mean of arm i at time t ;
Notations: Ui .t /: the upper confidence bound of arm i at time t ;
Parameter: ı 2 .0; 1=N /.

1: Initialization: pull each arm once in the first N rounds.


2: for t D N C 1 to T do
3: Identify the leader l.t/ among arms that have been pulled at least .t 1/ı times:

l t D arg max O i .t 1/:
i Wi .t 1/.t 1/ı

4: Identify the round-robin candidate r.t / D ..t 1/ mod N / C 1.


5: if O lt .t 1/ > Urt .t 1/ then
6: Pull arm l.t /.
7: else
8: Pull arm r.t /.
9: end if
10: end for

up to and including t , and let Xi .1/; : : : ; Xi .i .t // denote the random rewards from these i .t/
plays on arm i . Let
O i .t/ D hi .t/ .Xi .1/; : : : ; Xi .i .t/// (4.59)
Ui .t/ D g t;i .t/ .Xi .1/; : : : ; Xi .i .t/// ; (4.60)
where fhk gk1 and fg t;k g t 1;1kt are measurable functions. Note that the point estimate and
UCB may take different function forms for different sample size k and at different time t .
Lai and Robbins did not give an explicit construction of fhk gk1 and fg t;k g t1;1kt
for general reward distributions, but rather provided sufficient conditions on fhk gk1 and
fg t;k g t 1;1kt for achieving asymptotic optimality.
Condition 4.8 Lai–Robbins Condition on Mean Estimator and Upper Confidence Bound:
(i) Strong consistency of the mean estimator for a well-sampled arm:
 ˇ 
ˇ
ˇ
P max ˇhk .X.1/; : : : ; X.k// . /j >  D o.t 1
/ (4.61)
ıt kt

for all  2 ‚ and for all  > 0 and 0 < ı < 1.


4.3. ONLINE LEARNING ALGORITHMS 71
(ii) UCB lower bounded by the mean estimate:

g t;k ..x.1/; : : : ; x.k//  hk ..x.1/; : : : ; x.k// (4.62)

for all t  k and for all .x.1/; : : : ; x.k//.


(iii) UCB falling below the true mean with diminishing probability:
  1

P g t;k .X.1/; : : : ; X.k//  r for all k  t D 1 o t (4.63)

for all  2 ‚ and for all r < . /.


(iv) Monotonicity of UCB in t : for each fixed k D 1; 2; : : :,

g t C1;k .x.1/; : : : ; x.k//  g t;k .x.1/; : : : ; x.k// (4.64)

for all t  k and for all .x.1/; : : : ; x.k//.


(v) Asymptotically optimal exploration of suboptimal arms:
t
!
1 X 1
lim lim sup P Œg t;k .X. 1/; : : : ; X.k//  . 0 /   : (4.65)
#0 t !1 log t D.jj 0 /
kD1

For all  2 ‚ and for all  0 2 ‚ satisfying . 0 / > . /.

The above conditions are relatively intuitive. In particular, Condition (i) ensures that the
leader, determined by comparing estimated mean values among well-sampled arms, is indeed
the best arm with high probability as the number of samples grows. For all distributions with a
finite second moment, this condition is satisfied for the simple sample-mean estimator (see the
proof of Theorem 1 of Chow and Lai, 1975 [63]):
k
1X
hk .X.1/; : : : ; X.k// D X.j /: (4.66)
k
j D1

Conditions (ii–iv) ensure that every arm is sufficiently explored by dictating that the UCB never
falls below the true mean with high probability as time goes and the UCB of an arm kept
passive grows with time. Condition (v) is a crucial condition to ensure that exploration of an
inferior arm i with .i / < .  / occurs no more than .log t/=D.i jj  /, asymptotically and in
expectation, as dictated by the lower bound.
While the point estimator hk of the mean can be set to the sample mean in most cases,
the construction of the UCB g t;k satisfying conditions (ii)–(v) can be complex. Lai and Rob-
bins, 1985 [126], gave constructions for four distributions: Bernoulli, Poisson, Gaussian, and
Laplace. For the first two distributions, the mean value determines the entire distribution. For
72 4. FREQUENTIST BANDIT MODEL
the last two, it is assumed that all arms have the same variance and it is known. For all four dis-
tributions, the point estimate of the mean is set to the maximum likelihood estimate (which is
the sample mean for Bernoulli, Poisson, and Gaussian, and the sample median for Laplace), and
the UCB is constructed based on generalized likelihood ratios. We illustrate the construction
for the Bernoulli distribution in the example below.

Example 4.9 Lai–Robbins Policy for Bernoulli Arms:


Assume that arm i generates i.i.d. Bernoulli rewards with unknown parameter i (i D
1; : : : ; N ). We thus have ./ D  and ‚ D .0; 1/. The KL divergence D.jj/ is given by

 1 
D.jj/ D  log C .1 / log : (4.67)
 1 
It is easy to verify that the regularity conditions in Assumption 4.3 for the lower bound in
Theorem 4.4 are satisfied.
At time t , the mean estimate O i .t/ and the UCB Ui .t/ of arm i are given by

i .t /
1 X
O i .t/ D Xi .j /; (4.68)
i .t /
j D1
 
log t
Ui .t/ D inf  2 .0; 1/ W   O i .t/ and D.O i .t/jj/  : (4.69)
i .t/

The above construction of the UCB Ui .t/ is based on the stochastic behaviors of the generalized
P f .xIOj 1 /
log-likelihood ratio P Œ jkD1 log f .xI/ , where Oj D Oj .X.1/; : : : ; X.j // is an estimate of 
based on j random samples. It satisfies Conditions [C4.2–C4.5] for the Bernoulli as well as
Gaussian and Poisson distributions as shown by Lai and Robbins, 1985 [126]. We point out
that Lai and Robbins considered a general lower-bound sequence for D.O i .t/jj/ that satisfies
a set of conditions to ensure the asymptotic optimality of the resulting policy. To simplify the
presentation and give an explicit implementation, we choose in (4.69) the sequence of log t
i .t /
,
which can be easily shown to satisfy the conditions for the lower-bound sequence. While this
choice ensures asymptotic optimality, a fine tuning may lead to better finite-time performance.
We omit the detailed derivation here. The intuition behind the above constructed UCB is
that it is an overestimate of the mean as compared to the sample mean (the first condition in the
right hand of (4.69)) and its divergence from the sample mean grows with log t
i .t /
, a measure of
infrequency of exploration, to ensure asymptotically optimal exploration of a suboptimal arm.
The inexplicit characterization of Ui .t/ as an infimum can be worrisome when it comes
to implementation. Fortunately, its explicit value is not necessary for implementing the Lai–
Robbins policy. Since D.O i .t/jj/ is a convex function in  with minimum at  D O i .t/, it
is increasing in  for   O i .t/. As a result, the condition O lt .t /  Urt .t/ between the point
estimate of the leader l t and the UCB of the round-robin candidate r t (Line 5 of Algorithm 4.2)
4.3. ONLINE LEARNING ALGORITHMS 73
is equivalent to
log t
O l t .t /  O r t .t / and D.O r t .t /jjO l t .t// > ; (4.70)
r t .t/
which can be easily checked using only the mean estimates O lt .t/ and O rt .t/. Note that
D.O r t .t /jjO l t .t // can be easily computed by plugging in the mean estimates into (4.67).

The construction of the UCB in the Lai-Robbins policy as given in (4.69) recently reap-
peared in the KL-UCB policy for Bernoulli arms (Garivier and Cappé, 2011 [84], Maillard,
Munos, and Stoltz, 2011 [140], Cappé, et al., 2013 [54]). Specifically, the KL-UCB policy
selects the following arm to play at time t :
 
log t
a.t / D arg max sup  2 .0; 1/ W D.O i .t/jj/  : (4.71)
i D1;:::;N i .t /
The equivalence of (4.71) to the construction of Ui .t/ in (4.69) can be easily seen from the
fact that D.O i .t /jj/ is increasing in  for   O i .t/. The infimum over   O i .t/ satisfying
D.O i .t/jj/  log t
i .t/
thus equals the supremum over  satisfying D.O i .t /jj/  log t
i .t/
. The only
difference between these two policies is that the KL-UCB policy does not maintain a round-
robin candidate. This modification was also considered by Lai in 1987 [125], who showed that
the modified rule of selecting the arm with the highest UCB achieves asymptotic optimality
for distributions in the exponential family. The order optimality of the KL-UCB policy for
distributions with bounded support (other than Bernoulli) was also studied in [54, 84, 140].
Other asymptotically optimal policies include the sample-mean-based index policy by
Agrawal, 1995 [4], for Bernoulli, Gaussian, Poisson, and exponential distributions. Except for
Gaussian distribution, the indexes are only given implicitly and are difficult to compute. If the
goal is only order optimality, these indexes can be constructed explicitly as discussed in the next
section.

4.3.2 ORDER-OPTIMAL POLICIES


Asymptotically optimal policies achieve not only the optimal logarithmic regret order, but also
the best leading constant as specified by the lower bounds as given in Theorems 4.4 and 4.6.
Achieving asymptotic optimality requires a fine control on how often each suboptimal arm
should be explored, which depends on the KL divergence between the reward distributions
of a suboptimal arm and an optimal arm. If one is satisfied with order optimality, however,
the problem is much more relaxed. Several simple and intuitive approaches suffice as discussed
below.
Open-loop control of exploration-exploitation tradeoff A rather immediate solution is to
exert an open-loop control on when to explore each of the N arms. Specifically, the time hori-
zon is partitioned into two interleaving sequences: an exploration sequence and an exploitation
sequence. At time instants belonging to the exploration sequence, each of the N arms has an
74 4. FREQUENTIST BANDIT MODEL
equal chance to be selected, independent of past reward observations. This can be done either
in a round-robin fashion or probabilistically by choosing an arm uniformly at random. At time
instants in the exploitation sequence, the arm with the greatest sample mean (or a properly cho-
sen point estimator of the mean) is played. This approach separates in time the two objectives of
exploration and exploitation, and the tradeoff between these two is reflected in the cardinality
of the exploration sequence (equivalently, the cardinality of the exploitation sequence).
The partition of the time horizon into exploration and exploitation sequences can be done
deterministically or probabilistically. In a deterministic partition, the sequence of time instants
for exploration is explicitly specified. This approach was first suggested by Robbins, 1952 [168],
when the frequentist version of the bandit problems was first posed, and later studied in detail
by Agrawal and Teneketzis, 1989 [8], Yakowitz and Lowe, 1991 [215], and Vakili, Liu, and
Zhao, 2013 [194]. This approach was referred to as Deterministic Sequencing of Exploration
and Exploitation (DSEE) in [194]. In a probabilistic partition, each time instant t is assigned
to the exploration or the exploitation sequence with probability  and 1  , respectively. This
approach is known as the  -greedy policy (see Sutton and Barto, 1998 [187] and Auer, Cesa-
Bianchi, and Fischer, 2002 [23]).
What remains to be specified is the cardinality of the exploration sequence. On the one
hand, the regret order is lower bounded by the cardinality of the exploration sequence since a
fixed fraction of the exploration sequence is spent on suboptimal arms. On the other hand, the
exploration sequence needs to be chosen sufficiently dense to ensure effective learning of the
best arm. The key issue here is to find the minimum cardinality of the exploration sequence that
ensures a reward loss in the exploitation sequence caused by an incorrectly identified best arm
having an order no greater than the cardinality of the exploration sequence. The lower bound
established by Lai and Robbins (Theorem 4.4) shows that a consistent policy must explore each
arm at least in the order of log T times. This turns out to be also sufficient for policies with an
open-loop control of the exploration-exploitation tradeoff, leading to the optimal logarithmic
regret order. For the  -greedy policy, setting  as a function of time t that diminishes to zero at
a rate of 1=t gives a (random) exploration sequence with an (expected) cardinality in the order
of log T .
Compared with Lai and Robbins policy,  -greedy and DSEE are much simpler and more
general due to the nonparametric setting. This comparison highlights the often formidable gap
between achieving asymptotic optimality and merely order optimality. To achieve asymptotic
optimality, the exploration of each suboptimal arm i needs to be finely controlled to ensure not
only the order of the exploration is log T , but also the leading constant converges to 1=D.i jj  /.
For order optimality, however, it is not necessary to differentiate suboptimal arms in exploration,
and it suffices to explore all arms equally often, provided that the order of the exploration time is
logarithmic with a leading constant sufficiently large. Setting the leading constant in the cardi-
nality of the exploration sequence often requires a lower bound on the mean difference between
4.3. ONLINE LEARNING ALGORITHMS 75
the best and the second best arms. Setting it to an arbitrarily slowly diverging sequence circum-
vent this issue when such prior information is unavailable (see details in [194]).

Index-type policies The computation of the Lai–Robbins policy can be complex for general
reward distributions, requiring the entire sequence of past rewards received from each arm. This
motivated Agrawal, 1995 [4], to develop index-type policies that use only certain simple statistics
of reward sequences. Two key statistics to prioritize arms are the sample mean O i .t/ calculated
from past observations and the number i .t/ of times that arm i has been played in the past
t plays. The larger O i .t/ is or the smaller i .t / is, the higher the priority given to this arm in
arm selection. The tradeoff between exploration and exploitation is reflected in how these two
statistics are combined together for arm selection.
Agrawal, 1995 [4] developed order-optimal index policies for several distribution types,
in which an index Ii .t C 1/ is computed at time t C 1 (i.e., after t plays) for each arm and the
arm with the largest index is chosen. The index has the following forms:
8
ˆ
ˆ O i .t / C .2 log t C2 log log t 1=2
/ ; Gaussian
ˆ
ˆ n
i .t/
q o
ˆ
ˆ
ˆ
< O i .t/ C min 12 2 log t C4 log log t
; 1 ; Bernoulli
i .t/
Ii .t C 1/ D n q o ; (4.72)
ˆ
ˆ O i .t/ C min 12 2a log tC4a log log t
; a ; Poisson
ˆ
ˆ i .t/
ˆ
ˆ nq o
:̂ O i .t / C b min 2 log t C4 log log t
; 1 ; Exponential
i .t /

where constants a and b are, respectively, upper bounds on possible parameter values in the
corresponding families of Poisson and exponential reward distributions. The index for Gaussian
distributions (with known variance) achieves asymptotic optimality.
These index forms for different distribution types have the same structure: the sample
mean plus a UCB term determined by i .t /. Specifically, the second term of the index ensures
that each arm is explored in the order of log t times as dictated by the lower bound. For an arm
sampled in an order smaller than log t , its index, dominated by the second term, will be sufficient
large for large t to ensure further exploration.
Auer, Cesa-Bianchi, and Fischer, 2002 [23], modified Agrawal’s sample-mean-based in-
dex policies to offer order optimality for the family Fb of distributions with bounded support
(normalized to Œ0; 1). This is the much celebrated UCB-˛ policy with the following index form
similar to those in (4.72):
s
˛ log t
Ii .t C 1/ D O i .t / C ; (4.73)
i .t /

where ˛ > 1 is a policy parameter fixed to 2 in the UCB1 policy in Auer, Cesa-Bianchi, and
Fischer, 2002 [23].
76 4. FREQUENTIST BANDIT MODEL
Algorithm 4.3 The UCB-˛ Policy
Notations: i .t/: the number of times that arm i is pulled in the first t plays;
Notations: O i .t/: the sample mean of the i .t / rewards from arm i .
Parameters: ˛ > 1.

1: Initialization: pull each arm once in the first N rounds.


2: for t D N C 1 to T do
3: Compute the index Ii .t/ of each arm i as given in (4.73)
4: Pull the arm a.t / with the greatest index and observe reward xa.t / .
5: Update the sample mean and the number of plays for arm a.t/:

a.t/ .t/ D a.t / .t 1/ C 1


1 
O a.t/ .t/ D O a.t / .t 1/a.t/ .t 1/ C xa.t /
a.t/ .t /

6: end for

The work by Auer, Cesa-Bianchi, and Fischer, 2002 [23] gives an easily accessible regret
analysis that establishes not only the regret order but also a finite-time regret bound for all T .
We present this result in the theorem below.

Theorem 4.10 Assume that the random reward of each arm takes value in Œ0; 1. Define


i D  i : (4.74)

Then for every horizon length T , the problem-specific regret of UCB-˛ with ˛ > 1 is upper bounded
by
X  4˛ 1

RUCB-˛ .T I F/  log T C i : (4.75)
i ˛ 1
i Wi >0

Proof. We bound the regret by bounding i .T / for each suboptimal arm i (i.e., an arm i with
i > 0). In the following, all quantities associated with an optimal arm are indicated with a
subscript . Let  denote the UCB-˛ policy. By definition, we have
T
X
i .T / D 1 C I..t/ D i/: (4.76)
t DN C1
4.3. ONLINE LEARNING ALGORITHMS 77
The event .t / D i implies that Ii .t /  I .t/ (although not vice versa). To have Ii .t/  I .t/,
at least one of the following must hold:

s
log t
i < 2 ˛ ; (4.77)
i .t/
I .t/   ; (4.78)
s
log t
Ii .t/  i C 2 ˛ : (4.79)
i .t /

This can be seen by noticing that ifqnone of the above three inequalities is true, then the gap
i between  and i is at least 2 ˛ log t
i .t/
and the index I .t/ of the optimal arm exceeds 
while the index Ii .t/ of arm i does not make up the minimum gap between  and i , leading
to Ii .t/ < I .t /.
Set t0 D 4˛log
2
T
. We readily have t0  4˛log
2
t
for t D 1; : : : ; T . For i .t/ > t0 , (4.77) is
i i
false, thus at least one of (4.78) and (4.79) must be true. We then have

T
X  
i .T /  t0 C I (4.78) or (4.79) holds ; (4.80)
tDt0 C1

where the inequality follows from the simple fact that after arm i is pulled t0 times, the decision
time t is at least t0 C 1. Taking expectation and applying the union bound, we have

" s #
T
X T
X log t
EŒi .T /  t0 C P ŒI .t/    C P Ii .t/  i C 2 ˛ : (4.81)
t Dt0 C1 t Dt0 C1
i .t /

h q i
Next, we use the Hoeffding inequality to bound P ŒI .t/    and P Ii .t/  i C 2 ˛ log t
i .t/
.
Let O ;s denote the sample mean of the optimal arm calculated from s samples:
78 4. FREQUENTIST BANDIT MODEL

" s #
log t
P ŒI .t /    D P O  .t/ C ˛   (4.82)
 .t/
" r #
log t
 P 9s 2 f1; : : : ; t g W O ;s C ˛   (4.83)
s
t
" r #
X log t
 P O ;s C ˛  (4.84)
sD1
s
t
" r #
X log t
D P O ;s   ˛ (4.85)
sD1
s
Xt
1
 (4.86)
sD1
t 2˛
1
D ; (4.87)
t 2˛ 1
where (4.83) follows by considering all possible values of  .t/ and (4.86) follows from the
following Chernoff–Hoeffding inequality.

Lemma 4.11 Chernoff–Hoeffding Bound:


Let X.1/; : : : ; X.s/ be random variables with common support of Œ0; 1. Suppose that
P
EŒX.j /jX.1/; : : : ; X.j 1/ D  for j D 1; : : : ; s . Let O s D 1s js D1 X.j /. Then for all
a  0,
2 2
P ŒO s   a  e 2a n and P ŒO s   a  e 2a n : (4.88)

Similarly,
" s # " s #
log t log t 1
P Ii .t/  i C 2 ˛ D P O i .t/ i  ˛  : (4.89)
i .t / i .t/ t 2˛ 1

We then have
T
X 2
EŒi .T /  t0 C (4.90)
tDt0 C1
t 2˛ 1
Z 1
2
 t0 C dt (4.91)
1 t 2˛ 1
1
D t0 C : (4.92)
˛ 1
4.4. CONNECTIONS BETWEEN BAYESIAN AND FREQUENTIST BANDIT MODELS 79
The theorem then follows by plugging in the above bound on EŒi .T / to (4.6).


p
The worst-case regret of the UCB-˛ policy is in the order of O. N T log T /, which does
not match with the lower bound on minimax regret given in Theorem 4.7. Audibert and Bubeck,
2009 [19] proposed a modification of the UCB-˛ policy and showed that the modified policy,
referred
p to as MOSS (Minimax Optimal Strategy in the Stochastic case), achieves the opti-
mal N T minimax regret order while preserving the log T problem-specific regret order. The
modified index has the following form:
s
maxflog N Ti .t / ; 0g
Ii .t C 1/ D O i .t/ C : (4.93)
i .t /
The UCB-˛ policy and its variants are often presented and analyzed for reward distri-
butions with bounded support. Extensions to more general distribution families such as sub-
Gaussian and locally sub-Gaussian (i.e., light-tailed) distributions are relatively straightfor-
ward; see treatments in Bubeck and Cesa-Bianchi, 2012 [45] and Lattimore and Szepesvari,
2019 [129]. Dealing with heavy-tailed distributions is more involved; see the extension of the
UCB-˛ policy to heavy-tailed distributions by Bubeck, Cesa-Bianchi, and Lugosi, 2013 [50],
and that of the DSEE policy in Vakili, Liu, Zhao, 2013 [194].

4.4 CONNECTIONS BETWEEN BAYESIAN AND


FREQUENTIST BANDIT MODELS
While the Bayesian and frequentist frameworks lead to different bandit models with different
optimality criteria, learning algorithms developed within one framework can be applied and
evaluated in the other. This also brings the question on the existence of learning algorithms that
are optimal in some sense from both the Bayesian and the frequentist viewpoints. We see in this
subsection that the answer is positive.

4.4.1 FREQUENTIST APPROACHES TO BAYESIAN BANDITS


Consider first the direction of taking frequentist approaches (e.g., the UCB-type policies) as
solutions to the Bayesian bandit problem. If we restrict ourselves to the original bandit problem
of sampling unknown processes (referred to as the bandit sampling processes) where the arm
states are informational and state transitions obey Bayes rule, the applicability of frequentist
approaches can be readily seen. We simply ignore the prior probabilistic knowledge on the arm
parameters and directly apply the frequentist algorithms. Obviating the need to update posterior
distributions and using simple statistics such as the sample mean and the number of plays, the
resulting policies have an edge in computational efficiency. The main question is the performance
measured under the Bayesian criteria.
80 4. FREQUENTIST BANDIT MODEL
If the objective is the total discounted reward, ignoring prior knowledge and the discount-
ing factor are likely to incur significant performance loss. This is due to the fact that discounting
effectively shortens the decision horizon (hence amplifying the impact of prior knowledge on the
overall performance) and places heavier weights on decisions at the beginning of the time hori-
zon. If, however, the objective is the average reward over an infinite horizon, prior knowledge
has diminishing effect on the overall performance. All frequentist algorithms with a sublin-
ear regret order, not even necessarily the optimal sublinear order, offer the maximum average
reward  . This echoes the statement in Section 3.5.2 that the average-reward criterion is uns-
elective. The connection between asymptotic/order-optimality in the frequentist regret measure
and the optimality under the Bayesian gain-bias and sensitive-discount criteria as discussed in
Section 3.5.2 is worth exploring.
A performance measure that bridges the Bayesian and frequentist formulations is the
Bayesaian regret introduced by Lai, 1987 [125], which is the averaged problem-specific regret
over the problem instances with a given prior distribution. This corresponds to the Bayes risk
commonly used by the Bayesian school. Consider the parametric setting discussed at the begin-
ning of Section 4.2.1. The Bayesian regret of a policy  is given by
Z
R .T I /p./d; (4.94)


where R .T I / is the problem-specific regret in (4.16) and p./ the prior distribution of the
arm parameters  . Bayesian regret is not a function of  , but a function of the prior p./.
A significant result in Lai, 1987 [125] is that a modified version of the Lai–Robbins policy
that pulls the arm with the highest UCB is asymptotically optimal from both the Bayesian
and frequentist viewpoints for distributions in the exponential family. Specifically, this policy
achieves, with the corresponding optimal leading constants, both the optimal log T problem-
specific regret order and the optimal log2 T Bayesian regret order under sufficiently regular prior
distributions.
The Bayesian regret is considerably harder to analyze than its frequentist counterparts. To
bound the asymptotic Bayesian regret of a policy, we need to establish asymptotic properties of
its problem-specific regret that hold uniformly over a range of  in the parameter space so that
the integral in (4.94) can be evaluated. The wider this range of uniformity, the broader the class
of prior distributions under which the Bayesian regret can be bounded. An alternative approach
is to consider the worst-case prior specific to the policy pand the horizon length. This brings us
to the minimax setting, with the same optimal order of N T as in the frequentist framework.

4.4.2 BAYESIAN APPROACHES TO FREQUENTIST BANDITS


Bayesian approaches can be used to solve a frequentist bandit problem by adopting a fictitious
prior distribution of the arm parameters. In a nonparametric setting, the fictitious prior can
be set on the unknown mean i of each arm, and in addition, a fictitious likelihood function
4.4. CONNECTIONS BETWEEN BAYESIAN AND FREQUENTIST BANDIT MODELS 81
fQi .xI i / is needed for calculating the posterior distribution of i based on observations of the
reward Xi .
There are two considerations in choosing the fictitious prior distributions and likelihood
functions: the resulting frequentist regret performance and the computation involved in the
update of the posterior distributions. As has been shown, the frequentist regret performance,
particularly in terms of achieving asymptotic/order optimality, depends on certain matching
between the assumed prior/likelihood function with the underlying arm reward distributions.
In terms of the computation involved in posterior updates, it is desirable to choose a prior and
a likelihood function that form a conjugate pair, that is, the posterior calculated from the prior
and the likelihood function belongs to the same probability distribution family as the prior.
Examples are Beta prior for Bernoulli rewards and Gaussian prior (on the mean) for Gaussian
rewards as detailed below.
We use Thompson Sampling as an example to illustrate the idea of employing Bayesian
algorithms to Frequentist bandit models. As detailed in Example 2.3, Thompson Sampling is a
randomized policy that selects each arm with a probability equal to the posterior probability of
it being the optimal one. It is a suboptimal solution to the Bayesian bandit model under both
infinite horizon with discounted-reward criterion and the finite horizon problem as considered
by Thompson in his original work of 1933 [190]. If we let the horizon length T tend to infinity,
however, Thompson Sampling has recently been shown to be asymptotically optimal in terms
of the problem-specific regret for several frequentist bandit problems.
Consider first the parametric setting where the reward distribution of arm i is f .xI i /
(viewed as a function of x parameterized by i ). It is also the likelihood function when viewed
as a function of the unknown parameter i for a given observation x . With the true likelihood
function known, we only need to assume a fictitious prior distribution pQi for each arm. An im-
plementation of Thompson Sampling is given in Algorithm 4.4, which generalizes Example 2.3
that considered a uniform prior and Bernoulli reward distributions.
For a nonparametric setting, consider the class Fb of distributions with support over Œ0; 1.
The mean i of each arm is set to be the unknown parameter for which a prior will be as-
sumed and the posterior distribution will be updated based on a fictitious likelihood function.
We present below two ways of employing Thompson Sampling in this setting developed by
Agrawal and Goyal, 2012, 2013 [6, 7].
The first one is to use the beta distribution Beta.˛; ˇ/ as the prior and assume a Bernoulli
reward distribution with mean i for arm i (i D 1; : : : ; N ). A Bernoulli distributed reward ob-
servation X gives the following likelihood functions of i :
f .X D 1I i / D i ; f .X D 0I i / D 1 i : (4.95)
Being the conjugate prior for the likelihood functions induced by the Bernoulli observations,
a Beta.˛; ˇ/ prior, upon observing a Bernoulli reward X , leads to a posterior that is simply
Beta.˛ C 1; ˇ/ if X D 1 and Beta.˛; ˇ C 1/ if X D 0. The posterior in each decision period
thus remains in the beta family with the two parameters ˛ and ˇ simply counting the number of
82 4. FREQUENTIST BANDIT MODEL
Algorithm 4.4 Thompson Sampling for Parametric Frequentist Bandits
Input: pQ.1/
i
: a fictitious prior of i (i D 1; : : : ; N );

1: for t D 1 to T do
2: Generate a random sample Qi .t / from pQ.ti / (i D 1; : : : ; N ).
3: Compute, for i D 1; : : : ; N ,
Z  
.Qi .t// D xf xI Qi .t / dx:
x

4: Play arm aQ D arg maxi D1;:::;N .Qi .t// and receive reward xaQ .
5: Update the posterior distribution for arm aQ : for all z 2 ‚,

pQ.t / .z/f .xaQ I z/


pQ.tC1/ .z/ D R a
Q
a
Q
y f .xaQ I y/pQ.t/ .y/dy
a
Q

6: end for

successes and failures. The initial prior can be set to Beta.1; 1/, which is the uniform distribution
over Œ0; 1.
What remains to be specified is how to used the true reward observations with sup-
port Œ0; 1 in the above likelihood functions in (4.95) that assumes Bernoulli rewards of 0 or 1.
A method introduced in Agrawal and Goyal, 2012 [6], is to, upon receiving a reward x 2 Œ0; 1,
toss a coin with bias x and use the outcome as the input to the (fictitious) likelihood functions.
This translation of a general reward distribution over Œ0; 1 to a Bernoulli distribution preserves
the regret bounds developed under Bernoulli rewards.
The second approach to using Thompson Sampling in the nonparametric frequentist set-
ting is to adopt a Gaussian prior and assume a Gaussian reward distribution with mean i
and unit variance for arm i (i D 1; : : : ; N ). Since the Gaussian family is conjugate to itself, the
posterior distribution is Gaussian at each time. The initial prior can be set to the Gaussian dis-
tribution N .0; 1/. The posterior distribution at the beginning of decision time t > 1 for arm i
is then N .Q i .t 1/; i .t 11/C1 /, where i .t 1/, as usual, is the number of plays on arm i in the
P .t 1/
past t 1 plays and Q i .t 1/ D i .t 11/C1 jiD1 xi .j / with xi .1/; : : : ; xi .i .t 1// denoting
the i .t 1/ reward observations from arm i . A random sample is then drawn from the poste-
rior distribution of each arm, and the arm with the largest sample value is played at time t . Note
that the true rewards are within the support of the assumed Gaussian distribution. The observed
rewards are directly used in the posterior updates.
4.4. CONNECTIONS BETWEEN BAYESIAN AND FREQUENTIST BANDIT MODELS 83
In terms of the problem-specific regret, the asymptotic optimality of Thompson Sampling
with Beta prior for Bernoulli arms has been established by Agrawal and Goyal, 2012 [6] and
Kaufmann, Korda, and Munos, 2012 [110]. Its asymptotic optimality under properly chosen pri-
ors has been shown for single-parameter exponential family by Korda, Kaufmann, and Munos,
2013 [120], and for Gaussian arms by Honda and Takemura, 2014 [102]. The nearly optimal
worst-case regret of Thompson Sampling for arms with bounded support has been established
by Agrawal and Goyal, 2013 [7]. Strong empirical performance of Thompson Sampling for
frequentist bandits is also well documented.
Another Bayesian approach to frequentist bandits is the Bayes–UCB algorithm proposed
and analyzed by Kaufmann, Cappé, and Garivier, 2012 [109] and Kaufmann, 2018 [108]. In
Bayes–UCB, the quantiles of the posterior distribution, in the role of the UCB-based index,
is used for arm selection. The asymptotic optimality of Bayes–UCB in terms of the problem-
specific regret has been shown for Bernoulli arms (Kaufmann, Cappé, and Garivier, 2012 [109])
and single-parameter exponential families (Kaufmann, 2018 [108]).
85

CHAPTER 5

Variants of the Frequentist


Bandit Model
The set of variants of the frequentist bandit model that have been considered in the literature is
much richer than that of the Bayesian bandit model. This may partly due to the elasticity of the
order-optimality objective widely accepted for frequentist bandit problems. Assumptions can be
relaxed, additional model complexities can be introduced, yet the problem remains analytically
tractable under the objective of order optimality. For the Bayesian bandit model, the objective of
optimizing the total discounted reward is unforgiving of any suboptimal action during the entire
course of the decision process. As we have seen in Chapter 3, the optimality of the Gittins index
policy can be fragile, as is often the case with optimal policies for MDP problems with strikingly
simple and elegant analytical structures.
We cover in this chapter major first-order variants that focus on a single aspect of the
defining features of the bandit model. Higher-order variants can be easily constructed, if such
studies are of interest both intellectually and application-wise. Given the vast landscape of the
existing literature and the fast pace of progress in this active research area, the coverage of this
chapter, in terms of its inclusion of existing studies, old and new, as well as technical details
involved, is far from comprehensive. Our focus is on categorizing various formulations of these
bandit variants, outlining general ideas of solution methods, and providing references to rep-
resentative studies. Rather than providing a detailed presentation of specific algorithms and
analysis, we hope to highlight the connections and differences between these variants and the
canonical model, and the new challenges and how they might be addressed.

5.1 VARIATIONS IN THE REWARD MODEL


In the canonical frequentist bandit model, each arm i (i D 1; : : : ; N ) is associated with an i.i.d.
reward process fXi .t /g t 1 drawn from a fixed unknown distribution Fi .x/. In this section, we
consider bandit variants that adopt more general models for the arm reward processes.
One direction is to allow temporal dependencies in the reward processes. A natural candi-
date is the Markov process, leading to frequentist bandit models with intricate connections with
the Bayesian bandit models discussed in Chapters 2 and 3. In particular, depending on whether
the arm reward state continues to evolve when the arm is passive, two variants, referred to as the
rested frequentist bandit and the restless frequentist bandit, have been considered.
86 5. VARIANTS OF THE FREQUENTIST BANDIT MODEL
The second direction is to consider nonstationary reward processes where the unknown
distribution Fi .x/ of each arm is time varying. Three different temporal-variation models have
been considered: the abrupt-change model, the continuous-drift model, and the total-variation
model.
The third direction, which represents a much more significant departure from the canon-
ical model, is to forgo the stochastic nature of the reward models and consider deterministic
reward sequences potentially designed by an adversary that reacts to the player’s strategy. Re-
ferred to as the nonstochastic bandit or the adversarial bandit, this variant is receiving increasing
attention and demands a book of its own. We give here only a brief discussion.

5.1.1 RESTED MARKOV REWARD PROCESSES


Consider a bandit model where random rewards from successive plays of arm i form a finite-
state, irreducible, and aperiodic Markov chain with state space Si and an unknown transition
probability matrix Pi . When arm i is not engaged, its reward state remains frozen. This gives
us a bandit model with rested Markov reward processes.
A proxy for oracle and regret decomposition: The first step is to characterize the optimal
strategy of the oracle who has the knowledge of the transition probability matrix Pi of each arm.
This is the Bayesian bandit model under the finite-horizon total-reward criterion. As discussed
in Section 3.1.3, the optimal policy for this finite-horizon problem is in general nonstationary
and analytically intractable. The lack of an analytical characterization of the benchmark presents
a major obstacle to regret analysis and policy development.
With an objective focusing on the order of the regret as T approaches infinity, however,
it suffices to consider a proxy for the oracle that offers the same order of performance as the
oracle but better analytical tractability. This approach of finding a proxy benchmark turns out to
be quite powerful and will be called upon multiple times in this chapter.
For the bandit model with rested reward processes, the single-arm policy that always plays
the arm with the maximum limiting reward rate suffices as a proxy benchmark. Specifically, let
f i .x/gx2Si denote the unique stationary distribution of arm i . The asymptotic average reward
per play offered by arm i is thus given by, for any initial state x0 2 Si ,
" T #
 1 X X
i D lim E Xi .t / j Xi .1/ D x0 D x i .x/; (5.1)
T !1 T
tD1 x2Si

where we have used the fact that a finite-state, irreducible, and aperiodic Markov chain is
ergodic with a limiting distribution given by the unique stationary distribution. Due to the
rested nature and the ergodicity of the reward processes, the optimal oracle strategy that ex-
ploits the initial transient behaviors of arms will eventually settle down on the optimal arm
i D arg maxi D1;:::;N i after a finite number of arm switchings (bounded by the mixing times
of the Markov reward processes). Consequently, the performance gap between the single-arm
5.1. VARIATIONS IN THE REWARD MODEL 87
policy that always plays arm i and the optimal oracle strategy is upper bounded by a constant
independent of the initial arm reward states and the horizon length T . Thus, using  T as the
proxy benchmark preserves the asymptotic behavior of regret.
By applying Wald’s identity to the regenerative cycles of each Markov reward process, the
(proxy) regret of policy  can be written as
X
R .T I fPi g/ D . i /E Œi .T /; (5.2)
i Wi <

which has the same expression as (4.6) for i.i.d. rewards, except that i here is determined by the
limiting distribution of the corresponding Markov chain. In analyzing the regret performance,
the key quantities are again the expected times E Œi .T / spent on each suboptimal arm. As dis-
cussed below, the lower bound results and learning policies for the canonical frequentist bandit
model can be readily extended by employing large deviation results on the empirical distribution
and the empirical transition-count matrix of finite-state Markov chains.

Regret lower bound: Existing results on bandits with rested Markovian reward processes
mainly focus on the uniform-dominance approach. The first result was given by Anantharam,
Varaiya, and Walrand, 1987 [16], adopting a similar univariate parametric model as in the work
by Lai and Robbins, 1985 [126], for i.i.d. reward processes. Specifically, all arms have a common
state space S , and the transition probability matrix P.i / of arm i is determined by an unknown
parameter i belonging to a parameter space ‚. It is assumed that for all reward states x; x 0 2 S
and for all parameter values ;  0 2 ‚, if Px;x 0 . / > 0, then Px;x 0 . 0 / > 0. Furthermore, P./
is irreducible and aperiodic for all  2 ‚. The initial state distribution is assumed to be positive
for all states.
The problem-specific lower bound echoes that of Lai and Robbins, 1985 [126]. Specif-
ically, it was shown that under similar regularity conditions (specifically, the continuity of
D.jj/ in  >  for each fixed  2 ‚ and the denseness of ‚), for every uniformly good policy
 and every arm parameter set  such that .i / are not all equal, we have

R .T I / X  .i /
lim inf  ; (5.3)
T !1 log T D.i jj /
i W.i /<

where

D.i jj / D D.P.i /jjP. // (5.4)
is the KL divergence between two stochastic matrices. Recall that for two irreducible and aperi-
odic stochastic matrices P and Q satisfying Px;x 0 > 0 if and only if Qx;x 0 > 0, the KL divergence
is defined as

X X Px;x 0
D.PjjQ/ D .x/ Px;x 0 log ; (5.5)
x2S
Q x;x 0
0 x 2S
88 5. VARIANTS OF THE FREQUENTIST BANDIT MODEL
where f .x/g is the stationary distribution of P. Note that D.PjjQ/ is the expected KL diver-
gence between the distributions of the next state under P and Q conditioned on the current
state (i.e., individual rows of P and Q), where the expectation is with respect to the stationary
distribution of P (i.e., the current state is drawn randomly according to the stationary distribu-
tion of P). The proof of the above lower bound follows a similar line of arguments as in Lai and
Robbins, 1985 [126].

Asymptotic and order optimal policies: The Lai–Robbins policy was extended by Anan-
tharam, Varaiya, and Walrand, 1987 [15], and shown to preserve its asymptotic optimality.
Specifically, a point estimate of the mean O i .t / and UCB fUi .t /g t1 can be similarly constructed
using large deviation results on the empirical distribution and the empirical transition-count
matrix of finite-state Markov chains and under an additional assumption that log Px;x 0 . / is a
concave function of  for all x; x 0 2 S .
The UCB-˛ policy given in Algorithm 4.3 can also be extended to handle rested reward
processes and maintain its order optimality as shown by Tekin and Liu, 2012 [189]. The only
necessary change is to set the parameter ˛ in the UCB-˛ index to a value sufficiently large as
determined by the second largest eigenvalues of the transition matrices as well as the station-
ary distributions. Prior knowledge on nontrivial bounds on such quantities suffice for setting
this exploration constant. The DSEE and  -greedy policies can also be extended without much
difficulty.

5.1.2 RESTLESS MARKOV REWARD PROCESSES


Similar to the Bayesian restless bandit discussed in Section 3.3, the frequentist restless bandit
model involves reward processes that continue to evolve even when the associated arms are pas-
sive. The difference here is that the Markov models governing the state transitions are unknown
under both active and passive modes of the arms. To see it another way, the Bayesian restless
bandit model discussed in Section 3.3 defines the oracle strategy for the benchmark performance
that learning algorithms for the frequentist restless bandit aim to approach.
Analytical characterizations of the optimal policy for the Bayesian restless bandit are in-
tractable. It has been shown to be P-SPACE hard by Papadimitriou and Tsitsiklis, 1999 [162].
Whittle index policy is a good candidate for a proxy. Unfortunately, analytical characterizations
of the Whittle index policy are also intractable in general (in Chapter 6 we show a special case
where a logarithmic regret order with respect to the Whittle index policy can be achieved).
An alternative is to consider the single-arm policy that always plays the arm with the
greatest limiting reward rate, i.e., the proxy oracle for the rested case. This proxy, however, can
no longer preserve the same order of performance as the oracle. In fact, the performance gap to
the oracle can grow linearly with T . The resulting regret is thus a much weaker measure. This
notion of weak regret was also adopted by Auer et al., 2003 [24], for bandits with non-stochastic
reward processes.
5.1. VARIATIONS IN THE REWARD MODEL 89
Under the notion of weak regret, the UCB and the DSEE policies have been extended to
offer a logarithmic regret order for restless reward processes. The extensions, however, are much
more involved than in the rested case. Compared to the i.i.d. and the rested Markovian reward
models, the restless nature of arm state evolution requires that each arm be played consecutively
for a period of time in order to learn its Markovian reward statistics. The length of each segment
of consecutive plays needs to be carefully controlled to avoid spending too much time on a
suboptimal arm. At the same time, we experience a transient period each time we switch out
and then back to an arm, which leads to potential reward loss with respect to the steady-state
behavior of this arm. Thus, the frequency of arm switching needs to be carefully bounded.
These factors are balanced through an epoch structure in the extension of the DSEE policy
by Liu, Liu, and Zhao, 2013 [132]. Specifically, the time horizon is partitioned into interleaving
exploration and exploitation epochs with carefully controlled epoch lengths. During an explo-
ration epoch, the player partitions the epoch into N contiguous segments, one for playing each
of the N arms to learn their reward statistics. During an exploitation epoch, the player plays the
arm with the largest sample mean (i.e., average reward per play) calculated from the observations
obtained so far. The lengths of both the exploration and the exploitation epochs grow geomet-
rically. The number of arm switchings are thus at the logarithmic order with time. The tradeoff
between exploration and exploitation is balanced by choosing the cardinality of the sequence of
exploration epochs.
The UCB policy was extended by Tekin and Liu, 2012 [189], through a clever use of the
i.i.d. structure of the regenerative cycles of a Markov chain. The basic idea is to play an arm
consecutively for a random number of times determined by a regenerative cycle of a particular
state and arms are selected based on the UCB index calculated from observations obtained only
inside the regenerative cycles (observations obtained outside the regenerative cycles are not used
in learning). The i.i.d. nature of the regenerative cycles reduces the problem to the canonical
bandit model with i.i.d. reward processes. A different extension of the UCB policy was developed
by Liu, Liu, and Zhao, 2013 [132], by introducing a geometrically growing epoch structure.
Compared to the extension based on regenerative cycles, this extension avoids the potentially
large number of observations not used for learning before the chosen arm enters a regenerative
cycle defined by a particular pilot state.

5.1.3 NONSTATIONARY REWARD PROCESSES


When the reward process of each arm is nonstationary with time-varying distributions, the
online learning problem adds another dimension. The problem now becomes a moving-target
problem. In addition to the tradeoff between exploitation and exploration, the learning algo-
rithm also faces the dilemma of to remember or to forget past observations. On the one hand,
learning efficiency hinges on fully utilizing available data. On the other hand, to track the best
arm that changes over time, the algorithm needs to be sufficiently agile by forgetting outdated
observations that might have become misleading.
90 5. VARIANTS OF THE FREQUENTIST BANDIT MODEL
Analytical treatments of the problem depends on the model for the temporal variations of
the reward processes. Three different models have been considered in the literature: the abrupt-
change model, the continuous-drift model, and the total-variation model as discussed below.

The abrupt-change model: Under this model, the set of reward distributions fFi gN i D1 experi-
ences abrupt changes at unknown time instants. In other words, the reward process of each arm is
piecewise stationary with unknown change points (note that this includes scenarios where arms
change asynchronously). This model is often parameterized by the total number  of change
points over the horizon of length T .
To handle the abrupt changes of the reward distributions, it is necessary to incorporate
into the learning policy a mechanism for filtering out potentially obsolete observations; a direct
application of learning algorithms designed under the assumption of time-invariant models re-
sults in a near-linear regret order. For instance, a direct application of the UCB policy to bandits
experiencing two change points results in a regret order no smaller than T = log T as shown by
Garivier and Moulines, 2011 [85].
There are two general approaches to the filtering mechanism: an open-loop approach that
is agnostic to specific change points and adapts only to the total number T of change points,
and a fully adaptive approach that actively detects the change points based on observed rewards
and adjusts the arm selections based on the outcomes of the change detection.
Under the open-loop approach, the filtering component employs standard techniques in
model tracking: discounting, windowing, and restarting. In particular, the discounting UCB (D-
UCB) algorithm uses a discounted empirical average and a corresponding confidence bound
based on discounted time for computing the arm indexes. The sliding-window UCB (SW-UCB)
computes the arm indexes using only most recent observations within a window of a specific
length. The discount factor in D-UCB and the window length in SW-UCB are predetermined
based on the total number  of change points and the horizon length T . Both algorithms degen-
erate to the original UCB policy when  D 0, for which the discount factor in D-UCB becomes
1 and the window length in SW-UCB increases to T . The regret analysis adopts a minimax for-
mulation of the change points, focusing on the worst-case in terms ofpthe specific change points
shown that D-UCB offers a regret order of O. T log T / and
for a given  . It was p pSW-UCB
a regret order of O. T log T /. Both are near order-optimal (up to a log T and log T fac-
tor, respectively)
p when the number  of change points is bounded. This is established through
a O. T / lower bound on the regret order for the case of  D 2 by Garivier and Moulines,
2011 [85]. When  grows linearly with T (i.e., the change points have a positive density), a
linear regret order is inevitable, since each abrupt change incurs performance loss. Besides these
two extreme points in terms of  , a full characterization of the optimal regret scaling with respect
to  is lacking. The restarting technique employed by Auer et al., 2003 [24] for non-stochastic
bandits can be employed here as well, where a standard learning policy can be augmented with
a periodic restarting with the restarting period chosen a priori based on  .
5.1. VARIATIONS IN THE REWARD MODEL 91
The adpative approach employs a change detector to trigger restarts of the learning al-
gorithm. Representative results include Hartland et al., 2007 [100] and Liu, Lee, and Shroff,
2018 [133], that integrate classical change detection algorithms such as the Page-Hinckley Test
(PHT) and cumulative sum control chart (CUSUM) test with the UCB policies. In particular,
it was shown by Liu, Lee, and Shroff, 2018 [133] that when known lower bounds are imposed
on the changes in the reward mean as well as the time lapse q between two consecutive change
points, the CUSUM-UCB policy offers a regret order of O. T log T / with the knowledge
of  . Allesiardo and Feraud, 2015 [12] proposed a confidence-bound based test for detecting a
switching of the best arm (not every change in reward models results in a switching of the best
arm) to restart EXP3, a randomized policy originally developed for nonstochastic bandits (see
Section 5.1.4).
p Referred to as EXP3.R (EXP3 with restarting), this policy achieves a regret or-
der of O.Q T log T /, where Q   is the number of switchings of the best arm, without the
knowledge of Q or  . Assuming a specific stochastic model for the change points, Mellor and
Shapiro, 2013 [149] integrated a Bayesian change point detector into Thompson Sampling.

The continuous-drift model: The abrupt-change model focuses on temporal variations that
are significant in magnitude—with the minimum shift in reward mean bounded away from
zero—yet sparse in occurrence (with the number  of change points growing sublinearly with
T ). Slivkins and Upfal, 2008 [180] introduced a Brownian bandit model to capture continuous
and gradual drift of the reward models as often seen in economic applications (e.g., stock price,
supply and demand). In particular, the temporal dynamics of the reward mean of each arm i .t/
follow a Brownian motion with rate i2 . To avoid unbounded drift, the Brownian motions are
restricted to an interval with reflecting boundaries. The objective is to characterize the regret per
play in terms of the volatilities fi2 gN
i D1 of the arm reward models. The constant lower bound on
the regret per play established by Slivkins and Upfal, 2008 [180] implies a linear growth rate of
the cumulative regret in T .

The total-variation model: Besbes, Gur, and Zeevi, 2014 [36] considered a temporal variation
model that imposes an upper bound on the total variation of the expected rewards over the entire
horizon. The total variation is defined as
T
X1
T D max ji .t/ i .t C 1/j:
i D1;:::;N
t D1

The objective is to characterize the minimax regret order over the set of reward models with a
total variation no greater than a variation budget T . More specifically, the regret of a policy 
is given by
" T #
X
R .T I T / D sup E  .t/ .t/ .t/ ;
FT
tD1
92 5. VARIANTS OF THE FREQUENTIST BANDIT MODEL
where FT denotes the set of distribution sequences fFi .t/gi D1;:::;N I tD1;:::;T with bounded sup-
port in Œ0; 1 satisfying the variation budget T and  .t/ is the maximum expected reward at
time t . Besbes, Gur, and Zeevi, 2014 [36] established a regret lower bound of ..N T /1=3 T 2=3 /
and showed that EXP3 with periodic restarting achieves the optimal order up to a logarithmic
term in N . The restarting period is predetermined based on the variation budget T . In other
words, it is an open-loop policy in terms of its handling of model variations.
The total variation model includes abrupt changes and continuous drifts as special cases,
with a proper translation from the number  of changes and the volatilities fi2 gN iD1 to the total
variation budget T . The characterization of the minimax regret over this broader class, however,
does not apply to more specific models. Results with a minimax nature, when developed under
a broader model class, often sacrifice sharpness. When more information is available, tighter
lower bounds and better policies can be tailor-made for the specific model class.

5.1.4 NONSTOCHASTIC REWARD PROCESSES: ADVERSARIAL


BANDITS
Under an adversarial bandit model, the reward process of an arm i is an arbitrary unknown
deterministic sequence fxi .t /g t 1 with xi .t/ 2 Œ0; 1 for all i and t . In this case, one may ques-
tion what can actually be learned from past reward observations if the past and potential future
rewards are not bound by a common stochastic model and have no structure.
Indeed, if the objective is to approach the performance of an oracle who knows the entire
reward process of every arm a priori and plays optimally based on this knowledge, the learning
task is hopeless, and a thus-defined regret will grow linearly with T .
Weak regret, however, leads to a well-formulated problem. This corresponds to an oracle
who can only choose one fixed arm to play over the entire time horizon. Under the objective
of minimizing the weak regret, what the player is trying to learn is which arm has the largest
cumulative reward rather than trying to catch the largest reward at each time instant. Intuitively,
under the assumption of bounded reward, the former is possible as past reward observations
become increasingly more informative for learning the largest cumulative reward as time goes.
Besides adjusting the benchmark in the regret definition, the adversarial bandit model also
necessitates randomized strategies. A deterministic policy can be easily defeated by adversarially
chosen reward sequences, leading to a linear order of even the weak regret.
Under the minimax approach, the weak regret of an arm selection policy  is given by
 T
! " T #
X X
R .T / D max max xi .t/ E x t .t/ ; (5.6)
fxi .t/g t1;i D1;:::;N i D1;:::;N
tD1 tD1

where the expectation is over the internal randomness of policy  . In other words, the perfor-
mance of a policy  is measured against the worst reward sequences fxi .t/g t 1;i D1;:::;N .
The notion of the weak regret in the bandit setting corresponds to the external regret com-
monly adopted in game theory, where the utility of a strategy is measured against the best single
5.1. VARIATIONS IN THE REWARD MODEL 93
action in hindsight. In the context of games, a popular learning algorithm for minimizing ex-
ternal regret is the multiplicative weights method. It maintains a weight wa .t/ for each action a
(which corresponds to an arm in the bandit setting) as an indicator of the potential payoff of
this action. Specifically, the weight is given by
Pt
wa .t/ D e .t / j D1 xa .j / ; (5.7)

where xa .j / is the j th reward sample from taking action a and f.t/g t1 , referred to as the
learning rate (in the role of a step-size), is a positive and monotonically decreasing sequence. At
time t C 1, the algorithm chooses a random action a at time t with the following probability:
wa .t /
pa .t C 1/ D P : (5.8)
a0 wa0 .t/

The name of the algorithm comes from the fact that the weight wa .t/ is updated multiplicatively
under a constant learning rate , i.e., wa .t/ D wa .t 1/e xa .t/ .
The multiplicative-weights method was developed under full-information feedback where
the rewards of all actions a player could have taken are revealed at the end of each decision period.
For minimizing the weak regret in the adversarial bandit setting, Auer et al., 2003 [24] modified
this algorithm in order to handle the change in the feedback model from full-information to
bandit feedback. Referred to as Exploration-Exploitation with Exponential weights (EXP3),
this bandit algorithm differs from the multiplicative-weight algorithm in two aspects: (i) cor-
recting the bias (caused by the incompleteness of the bandit feedback) in the estimated reward of
the chosen action by dividing the probability of selecting it; and (ii) mixing the action distribu-
tion with a uniform distribution for exploration to ensure that sufficient information is obtained
for all arms under the bandit feedback. It was later noted by Stoltz (2005) [183] that this uniform
exploration is unnecessary. The implementation given in Algorithm 5.5 is without this uniform
exploration. Another modification in Algorithm 5.5 is that the unbiased reward estimate is ob-
tained through an unbiased estimate of the loss 1 xi .t / to mitigate the issue of large variance
of the estimates associated with arms played with a small probability (see a detailed discussion
in Lattimore and Szepesvári, 2019 [129]). p p
The weak regret of EXP3 was shown to be O. T N log N /. An . T N / lower bound on
the weak regret follows directly from the lower bound on the minimax regret in the stochastic
setting as given in Theorem 4.7. Specifically, if the regret p averaged over reward sample paths
generated under a certain set of arm distributions is . T Np/, there must exist a sample path
for which the (weak) regret has an order that is no less than T N . The log N gap between the
regret order of EXP3 and the lower bound was closed by Audibert and Bubeck, 2009 [19] by
considering a class of functions more general than the exponential function for weights.
Other notions of regret defined with respect to a benchmark policy with a given hardness
constraint on the number of arm switches were also considered in Auer et al., 2003 [24].
94 5. VARIANTS OF THE FREQUENTIST BANDIT MODEL
Algorithm 5.5 The EXP3 Algorithm

1: Initialization: set vi .0/ D 0 for i D 1; : : : ; N .


2: for t D 1 to Tqdo
3: Set .t/ D log tN
N
.
4: Calculate the distribution for arm selection:
exp ..t /vi .t 1//
pi .t / D PN :
j D1 exp .t /vj .t 1/

5: Pull arm i t chosen randomly according to the probability distribution Œp1 .t/; : : : ; pN .t/.

6: Receive reward xit .t/ from arm i t .


7: Calculate vi .t/ for all i :
 
1
vi .t/ D vi .t 1/ C 1 .1 xi t .t//IŒi D i t  :
pi .t /

8: end for

5.2 VARIATIONS IN THE ACTION SPACE


5.2.1 LARGE-SCALE BANDITS WITH STRUCTURED ACTION SPACE
The canonical bandit model assumes independence across arms. In this case, reward observations
from one arm do not provide any information on the quality of other arms. A linear growth rate
with N of the problem-specific regret is expected given that every arm needs to be sufficiently
explored. The main focus of the bandit theory has thus been on the regret growth rate with T ,
which measures the learning efficiency over time.
Many emerging applications lead to bandit problems with a massive and sometimes an
uncountable number of arms. Consider, for example, adaptive routing in a network with un-
known and stochastically varying edge weights (see Section 6.1.2). Each path from the source
node to the destination node constitutes an arm, and the number of possible paths can grow
exponentially with the network size. In the application of ads display on search engines, the
collection of all ads is the set of arms with reward measured by the number of clicks. In political
campaign or targeted marketing of a new product, arms are individuals in a massive population.
When casting stochastic optimization of a random unknown function as a bandit problem, the
arm set is given by the domain of the objective function, consisting of uncountable alternatives
in a potentially high-dimensional Euclidean space. Developed under the assumption of inde-
5.2. VARIATIONS IN THE ACTION SPACE 95
pendent arms and relying on exploring every arm sufficiently often, learning policies for the
canonical bandit model no longer offer appealing or even feasible solutions, especially in the
regime of N > T .
The key in handling a large number of arms is to fully exploit certain known structures
and relations among arms that naturally arise in specific applications. For instance, in network
optimization problems such as adaptive routing, the large number of arms (i.e., the paths) are
dependent through a much smaller number of unknowns (i.e., the edge weights). In many social-
economic applications, arms have natural similarity and dissimilarity relations that bound the
difference in their expected rewards. For instance, in recommendation systems and information
retrieval, products, ads, and documents in the same category (more generally, close in a certain
feature space) have similar expected rewards. At the same time, it may also be known a priori that
some arms have considerably different mean rewards, e.g., news with drastically different opin-
ions, products with opposite usage, documents associated with key words belonging to opposite
categories in the taxonomy. Such known arm relations offer the possibility of a sublinear regret
growth rate with N , indicating that the best arm can be learned by trying out only a diminishing
fraction of arms when N approaches to infinity.
A graphical random field representation of action space: The structure of the action space
of a bandit model can be represented by a graphical random field, which is a generalization of
graphical models. A random field is a collection of random variables indexed by elements in a
topological space. It generalizes the concept of stochastic process in which the random variables
are indexed by real or discrete values referred to as time. A graphical random field G D .V ; E /
consists of a set V of vertices representing random variables and a set E of potentially directed
edges representing the presence of a certain relation between connected random variables. The
vertex set V can be partitioned into two subsets, one consists of N random variables fXi gN i D1
representing the random reward of each of the N arms,1 the other consists of M latent random
variables fYj gjMD1 that determine or influence the first set of random variables. The distributions
of both sets of random variables are unknown. A realization of Xi (i.e., a reward) is observed
once this arm is engaged. The latent variables are in general (but not always) unobservable. The
edges represent certain relations across the random variables associated with the vertices. The
relations represented by edges can be categorized into two classes—realization-based relations
and ensemble-based relations—as detailed blow.
Realization-based arm relations: Realization-based relations capture how the value (i.e., a
realization) of a random variable is affected by the values of its neighbors in the graphical ran-
dom field. Most existing reward structures considered in the bandit literature are special cases of
this class. Representative models include the combinatorial bandits considered by Gai, Krish-
namachari, and Jain, 2011 [82], Liu and Zhao, 2012 [137], Chen, Wang, and Yuan, 2013 [60],
and Kveton et al., 2015 [124], linearly parameterized bandits by Dani, Hayes, and Kakade,
1 This model also applies to bandits with a continuum set of arms due to the general definition of random fields.
96 5. VARIANTS OF THE FREQUENTIST BANDIT MODEL
2008 [72], Rusmevichientong and Tsitsiklis, 2010 [173], Abbasi-Yadkori, Pal, and Szepes-
vari, 2011 [1], and spectral bandits for smooth graph functions by Valko et al., 2014 [199] and
Hanawal and Saligrama, 2015 [97]. Specifically, the resulting graph is a bipartite graph with
the latent variables fYj gjMD1 and the arm variables fXi gN
i D1 as the partite sets (see Figure 5.1).
The value of Xi is a known deterministic function of its latent neighbors:
Xi D hi .Yj W .Yj ; Xi / 2 E /: (5.9)
In particular, for combinatorial bandits, the deterministic functions hi are often the sum2
of the input: X
Xi D hi .Yj W .Yj ; Xi / 2 E/ D Yj : (5.10)
j W.Yj ;Xi /2E

Consider the specific application of adaptive routing which leads to a combinatorial bandit. The
latent variables fYj gjMD1 are the random weights of the M edges in the network, and the arm
variables fXi gN
i D1 are the path weights from the source to the destination. The path weight (i.e.,
the reward/cost for playing an arm) is the sum of the edge weights along this path. In the work
by Liu and Zhao, 2012 [137], the latent variables (i.e., the individual edge weights) are not
observable. In the work by Gai, Krishnamachari, and Jain, 2011 [82], Chen, Wang, and Yuan,
2013 [60], and Kveton et al., 2015 [124], it is assumed that all latent variables controlling the
chosen arm are also observed. The latter setting is referred to as the combinatorial semi-bandit
to differentiate it from the bandit setting where only the reward of the chosen arm, not the
constituent latent variables, is observed.

Y1 Y2 Y3 YM

X1 = h1(Y1,Y3) X2 = h2(Y1,Y2) XN = hN (Y2,Y3,YM)

Figure 5.1: A graph representation of a structured action space.

For linear bandits considered by Dani, Hayes, and Kakade, 2008 [72], Rusmevichientong
and Tsitsiklis, 2010 [173], and Abbasi-Yadkori, Pal, and Szepesvari, 2011 [1], the bipartite
graph is complete, i.e., there is an edge between every pair of .Yj ; Xi /. Each deterministic func-
tion hi is specified by an M -dimensional real vector wi that is known, and the reward of this
P
arm/action is given by a weighted sum of all latent variables3 : Xi D jMD1 wi .j /Yj . The spectral
2 In the work by Chen, Wang, and Yuan, 2013 [60], nonlinear functions are allowed provided that the nonlinear function
is commutative with the expectation operation.
3 Some linear bandit models considered in the literature also allow a zero-mean random noise term added to each arm
variable. This, however, does not change the reward structure given that only the mean values of the arms matter under the
objective of expected total reward.
5.2. VARIATIONS IN THE ACTION SPACE 97
bandit considered by Valko et al., 2014 [199] and Hanawal and Saligrama, 2015 [97], is a special
case of the linear bandits with M D N and the weight vectors wi given by the i th eigenvector
of a known fixed graph.
More general realization-based relations (for example, hi can be a random function or
specifies the conditional distribution) can be modeled under the graphical random field repre-
sentation for studying large-scale bandits with a structured action space and how various aspects
of the structure affect the learning efficiency both temporally (in T ) and spatially (in N ).
Ensemble-based arm relations: Ensemble-based relations capture arm similarities in their
ensemble statistics—in particular, the mean—rather than probabilistic dependencies in their
realizations.
A first such example is the continuum-armed bandit problem first formulated by Agrawal,
1995b [5] under the minimax approach.
Definition 5.1 The Continuum-Armed Bandit Model:
Let S  R be a bounded set, representing a continuum set of arms. Playing arm s 2 S generates
a random reward drawn from an unknown distribution fs , i.i.d. over time. Let .s/ denote the
expected reward of arm s . It is assumed that .s/ W S ! R is uniformly locally Lipschitz with
constant L (0  L < 1), exponent ˛ (0 < ˛  1), and restriction ı (ı > 0), i.e., for all s; s 0 2 S
satisfying js s 0 j  ı , we have

j.s/ .s 0 /j  Ljs s 0 j˛ : (5.11)

Let L.˛; L; ı/ denote the class of all such functions. The objective is an arm selection policy that
minimizes the regret in the minimax sense over L.˛; L; ı/. Specifically, the worst-case regret of
policy  is defined as
T
X

R .T / D sup . .s .t/// ;
.s/2L.˛;L;ı/ t D1

where  D sups2S .s/ and s .t/ is the arm chosen by policy  at time t .

The smoothness of the mean reward function .s/ as imposed by the uniformly locally
Lipschitz condition is the key to handling the continuum set of arms, making it possible to
approach the optimal arm by a sparse sampling the continuum set. This bandit model is also
known as the Lipschitz bandit.
An algorithm (UniformMesh) was developed by Kleinberg, 2004 [115], based on an open-
loop approximation of the mean reward function .s/. Intuitively, the smoothness of the mean
reward makes it possible to approximate the continuum-armed bandit problem with a finite-
armed bandit problem by a discretization of the arm set. It was shown by Kleinberg, 2004 [115]
that this strategy based on an open-loop uniform discretization of the arm set offers a regret order
of O.T 2=3 log1=3 .T //, matching the lower bound of .T 2=3 / up to a sublogarithmic factor.
98 5. VARIANTS OF THE FREQUENTIST BANDIT MODEL
It might appear quite surprising that such a simple open-loop reduction of the continuum-
armed bandit to a finite-armed bandit suffices for near-optimal regret performance. The reason
behind this is the minimax criterion that focuses on the performance under the worst possible
mean-reward functions in L.˛; L; ı/. Such worst possible payoff functions are those that are
the least smooth (i.e., achieving the upper bound in the Lipschitz condition (5.11)) around the
maxima. Thus, a uniform discretization of the arm set with a predetermined level of refinement
(given by the Lipschitz exponent ˛ ) suffices.
Intuition suggests that the hardness of a continuum-armed bandit is determined by the
smoothness of the payoff function around the maxima, not the global parameters that bound the
smoothness over the entire domain. It is thus desirable to have learning strategies that can adapt
automatically to the given payoff function and take advantage of its higher-order smoothness
around the maxima if present to achieve better problem-specific performance. The same intuition
also suggests that the global Lipschitz condition can be relaxed to local smoothness properties
around the maxima of the payoff function.
These objectives call for adaptive strategies that successively refine arm discretization based
on past observations. The resulting discretization of the arm set is nonuniform, with more ob-
servation points around the maxima and the refinement adapting automatically to the local
smoothness around the maxima. Representative studies along this line include Auer, Ortner, and
Szepesvari, 2007 [25], Bubeck, Munos, Stoltz, and Szepesvari, 2011 [48], Cope, 2009 [69], and
Kleinberg, Slivkins, and Upfal, 2015 [117], assuming various assumptions on the local smooth-
ness of the payoff function around the maxima, leading to different regret orders.
The continuum-armed bandit over a compact set of the real line has been extended to
high-dimension metric space by Kleinberg, Slivkins, and Upfal, 2015 [117], and generic mea-
surable space by Bubeck, Munos, Stoltz, and Szepesvari, 2011 [48]. Another direction is to
consider more restrictive payoff functions, including linear and convex functions for the ap-
plication of online stochastic optimization (see Dani, Hayes, and Kakade, 2008 [72], Agarwal
et al., 2013 [2], and references therein). The issue of unknown parameters in the smoothness
properties of the payoff function has been addressed by Bubeck, Stoltz, and Yu, 2011 [49],
Slivkins, 2011 [177], Minsker, 2013 [150], Valko, Carpentier, and Munos, 2013 [198], and
Bull, 2015 [53].
Other bandit models exploiting ensemble-based arm structures include the taxonomy
bandit (Slivkins, 2011 [177]), the unimodal bandit (Combes and Proutiere, 2014 [66]), and
the unit-interval-graph (UIG) bandit (Xu et al., 2019 [214]). In the taxonomy bandit model,
statistical similarities across arms are captured by a tree structure that encodes the following en-
semble relation: arms in the same subtree are close in their mean rewards. In unimodal bandits,
the ensemble-based arm relation is represented by a graph with the following property: from
every sub-optimal arm, there exists a path to the optimal arm along which the mean rewards
increase. The UIG bandit model exploits not only statistical similarities, but also statistical dis-
similarities across arms, with the similarity-dissimilarity relations represented by a unit interval
5.3. VARIATIONS IN THE OBSERVATION MODEL 99
graph that may be only partially revealed to the player. A general formulation of structured ban-
dits was proposed in Combes, Magureanu, and Proutiere, 2017 [68], which includes a number
of bandit models (e.g., linear bandits, Lipschitz bandits, unimodal bandits, and UIG bandits)
as special cases. The learning policy developed there, however, was given only implicitly in the
form of a linear program that needs to be solved at every time step. For certain bandit models
(e.g., the UIG bandit), the linear program does not admit polynomial-time solutions (unless
P=NP). We see again here the tension between the generality of the problem model and the
efficiency of the resulting solutions.

5.2.2 CONSTRAINED ACTION SPACE


There are several studies on bandit models where not all arms are available for activation at all
times. This can be determined by either a given exogenous process or through self control to
satisfy a certain budget constraint.
An example of the first type is the so-called sleeping bandit studied by Kleinberg,
Miculescu-Mizil, and Sharma, 2008 [116]. Specifically, the set of available arms is time-varying,
resulting in time varying identities of the best arm depending on the set of available arms. This
variant can be quite easily dealt with. It is shown that a simple modification of the UCB policy
that chooses the arm with the greatest index among the currently available ones achieves the
optimal regret order. This problem was also studied under the non-stochastic setting with ad-
versarial reward processes by Kleinberg, Miculescu-Mizil, and Sharma, 2008 [116] and Kanade,
McMahan, and Bryan, 2009 [107].
The second type is the so-called Knapsack bandit or bandit with a budget (see Tran-
Thanh et al., 2010 [191]; Tran-Thanh et al., 2012 [192]; Badanidiyuru, Kleinberg, and Slivkins,
2013 [28]; Jiang and Srikant, 2003 [103]; Combes, Jiang, and Srikant, 2015 [67]). Specifically,
the horizon length T (which can be viewed as a budget constraint) is replaced with a more general
form of budget B for arm activation. Each activation of an arm, in addition to generating random
rewards, consumes a certain amount of budget that can be arm-dependent. The objective is to
minimize the cumulative regret over the course till the depletion of the total budget. The focus
is similarly on order optimality as the budget B tends to infinity. The learning algorithms often
employ UCB-type indices with the quality of an arm measured by the reward-to-cost ratio.

5.3 VARIATIONS IN THE OBSERVATION MODEL


5.3.1 FULL-INFORMATION FEEDBACK: THE EXPERT SETTING
The bandit feedback reveals only the reward of the chosen arm. It is this absence of observations
from unselected arms that dictates the need for exploration. In certain applications, however,
the rewards of unselected arms, although not realized, can be observed. Consider a stylized
portfolio selection problem that decides periodically (e.g., every month) which financial advisor
to follow. Although only the gain from the selected portfolio is accrued, the performance of
100 5. VARIANTS OF THE FREQUENTIST BANDIT MODEL
the portfolios suggested by other advisors is also observed and can be used in choosing the next
action. The objective is to minimize the cumulative loss (i.e., regret) against the performance of
the best advisor in the given set. This sequential learning problem with full-information feedback
is often referred to as prediction with expert advice, or the expert setting in short.
Since all arms are observed at all times, exploration is no longer a consideration when
deciding which arm to pull. Consequently, under i.i.d. reward processes, the current action has
no effect on the future. The optimal action at each time is to exploit fully for maximum instanta-
neous gain. Since the best arm can be identified with probability 1 within finite time, a bounded
regret can be achieved, and a simple follow-the-leader strategy suffices.
While the stochastic version of the expert setting admits simple solutions, its nonstochas-
tic counterpart where the reward sequence of each expert/arm is deterministic and adversarially
designed presents sufficient complexity. It has been studied extensively within a game-theoretic
framework since the 1950s with the pioneering work of Blackwell, 1956 [40] and Hannan,
1957 [98]. A comprehensive coverage of this subject can be found in the book by Cesa-Bianchi
and Lugosi, 2006 [58].

5.3.2 GRAPH-STRUCTURED FEEDBACK: BANDITS WITH SIDE


OBSERVATIONS
The bandit feedback and the full-information feedback are at the two ends of the spectrum in
terms of the information available for learning after each chosen action. A general observation
model that includes these two as special cases is as follows. At each time t , after pulling arm i , the
rewards of arms in a set ‰.t; i / are observed. This set may depends on not only the action i , but
also the time index t . The bandit feedback and the expert feedback correspond to, respectively,
‰.t; i / D fig and ‰.t; i / D f1; 2; : : : ; N g for all t and i .
This observation model can be represented by a directed graph G t for each t . Specifically,
each vertex of G t represents an arm. A directed edge from vertex i to vertex j exists if and only
if pulling arm i at time t leads to an observation of arm j , i.e., if and only if j 2 ‰.t; i /. It is
easy to see that under the bandit feedback, G t consists of only self-loops and is time invariant.
The expert setting corresponds to a time-invariant graph with all N 2 directed edges.
Referred to as bandits with side observations, this online learning problem with graph-
structured feedback was first introduced by Mannor and Shamir, 2011 [144], within a non-
stochastic framework. The stochastic version of the problem was studied by Caron et al.,
2012 [55] and Buccapatnam, Eryilmaz, and Shroff, 2014 [52].
The central issue under study is how the graph-structured feedback affect the scaling of
regret with respect to the number N of arms. It has been shown that under the nonstochastic
formulation, it is the independence number of the observation graph that determines the regret
scaling with N . Under the stochastic formulation, regret can be made to scale with the clique
partition number or the size of the minimum dominating set of the observation graph. The
issues of learnability when certain self-loops are missing and when the observation graph is
5.3. VARIATIONS IN THE OBSERVATION MODEL 101
time varying and never fully revealed to the player are studied by Alon et al., 2015 [13] and
Cohen, Hazan, and Koren, 2016 [65].

5.3.3 CONSTRAINED AND CONTROLLED FEEDBACK:


LABEL-EFFICIENT BANDITS
In certain applications, obtaining feedback can be costly. This can be captured by imposing a
constraint on the total number of observations that can be obtained over the entire horizon of
length T . Specifically, at each time t , after choosing an arm to pull, the player decides whether
to query the reward obtained from the current pull of the chosen arm, knowing that the total
number of such queries cannot exceed a given value. The resulting formulation is referred to
as label-efficient bandits, following the terminology in learning from labeled data. Such con-
strained and controlled feedback can also be incorporated into the expert setting, with each
query producing reward observations of all arms. A fully controlled feedback model would be to
allow the player to choose, at each time, which subset of arms to observe, while complying with
a constraint on the total number of labels summed over time and across all arms. In other words,
the player designs the sequence of feedback graphs fG t g t 1 that provides the most information
under the given constraint. An interesting question is whether this feedback model may result
in a decoupling of exploration and exploitation, with the feedback graphs designed purely for
exploration and the arm pulling actions chosen for pure exploitation. It appears that this model
has not been studied in the literature.
Existing studies on constrained and controlled feedback (both the bandit and the expert
settings) all adopt the nonstochastic/adversarial formulation (see, for example, Helmbold and
Panizza, 1997 [101]; Cesa-Bianchi, Lugosi, and Stoltz, 2005 [56]; Allenberg et al., 2006 [11];
Audibert and Bubeck, 2010 [20]). The deterministic nature of the reward sequences and the
focus on the worst-case performance (i.e., minimax regret) render sophisticated query strategies
unnecessary. A simple open-loop control of the queries suffices to achieve the optimal minimax
regret order: simply toss a coin to determine whether to query at each time with the bias of the
coin predetermined by the feedback constraint and the horizon length. Whether the stochas-
tic setting under the uniform dominance objective would demand a more sophisticated query
strategy appears to be an unexplored direction.

5.3.4 COMPARATIVE FEEDBACK: DUELING BANDITS


The traditionally adopted observation model and its variants discussed above all assume that an
absolute quantitative measure of arm payoffs exists and can be observed on demand. In certain
applications, however, this assumption may not hold. For instance, in determining the qual-
ity/relevance of an image or a rank-listed documents for a given query, it is difficult to obtain
unbiased response from users in terms of a quantitative value. However, the preference of the
user when comparing two options can be reliably obtained. This motivates the formulation of
102 5. VARIANTS OF THE FREQUENTIST BANDIT MODEL
dueling bandits, in which observations are binary preference outcomes for comparing a pair of
arms.
One way to formulate a dueling bandit problem is to preserve the reward model of the
canonical bandit problem and change only the observation model. Specifically, a quantitative
measure for the rewards offered by each arm exists, although it cannot be observed. At each
time t , the player chooses a pair of arms i and j (i D j is allowed) and obtains a binary feedback
on whether arm i is better than arm j , which is determined by whether a random reward Xi
from arm i drawn from an unknown distribution Fi is greater than that of Xj from arm j . The
binary feedback is thus given by a Bernoulli random variable with parameter PFi Fj ŒXi > Xj .
Depending on the applications, the actual reward accrued by the player from choosing a pair
of arms can be defined as the maximum, the minimum, or the average of the random rewards
offered by the chosen pair. Regret can then be defined in the same way as in the canonical bandit
model.
Another approach is to adopt directly a probabilistic model for pairwise comparisons of
arms without binding it to or assuming the existence of a generative reward model characterized
by fFi gN i D1 . The underlying probabilistic model for the comparative feedback is given by an
N  N preference matrix P whose .i; j /th entry pi;j indicates the probability that arm i is
preferred over arm j . It is natural to set pi;i D 12 and pi;j D 1 pj;i . We say arm i beats arm j
when pi;j > 21 .
Under this formulation, it is not immediately obvious how to define regret. The first issue
is how to define the best arm when a linear ordering may not exist (e.g., consider sports teams
and the following scenario: A beats B, B beats C, and C beats A). The second issue is how to
define the performance loss of the chosen pair of arms with respect to the best arm when a
quantitative measure of arm payoffs does not exist.
Defining the winner based on (pairwise) preferences is a rich subject with a long history
in social science, political science, and psychology (see, for example, Arrow, 1951 [17]). Two
notable definitions are the Condorset winner (one that beats all others, which may not always
exist) and the Copeland winner (one that beats the most number of candidates). Once the best
arm i is defined, the performance loss of a chosen pair of arms .i; j / with respect to the best
arm i is often defined as various functions of .pi ;i ; pi ;j / or the normalized Copeland scores
(i.e., the fraction of candidates inferior to the one in question) of i , i , and j . The choice of
the loss measure bears less significance on the algorithm design and regret analysis since they all
have similar dependency (with bounded difference) on the number of times a suboptimal arm is
chosen.
Existing studies on dueling bandits all take the second formulation that directly assumes
a preference matrix P. Various assumptions on P (e.g., strong stochastic transitivity, stochastic
triangle inequality) and different definitions of the best arm and loss measures have been con-
sidered. We refer the reader to a few representative studies, see Yue et al., 2012 [216], Zoghi
5.4. VARIATIONS IN THE PERFORMANCE MEASURE 103
et al., 2014 [220], Komiyama et al., 2015 [119], Zoghi et al., 2015 [221], and Wu and Liu,
2016 [213].
The first formulation that builds upon a generative reward model becomes a special case
of the second formulation when the preference probability pi;j is given by pi;j D PFi Fj ŒXi >
Xj . An unexplored question is whether the generative reward model behind the comparative
feedback offers additional structure for improved learning efficiency.

5.4 VARIATIONS IN THE PERFORMANCE MEASURE


5.4.1 RISK-AVERSE BANDITS
The canonical bandit model targets at maximizing the expected return. In many applications such
as clinical trials and financial investment, the risk and uncertainty associated with the chosen
actions need to be balanced with the objective of high returns in expectation. This calls for risk-
averse learning.

Risk measures: The notions of risk and uncertainty have been widely studied, especially in
economics and mathematical finance. There is no consensus on risk measures, and likely there
is no one-size-fits-all. Different applications demand different measures, depending on which
type of uncertainty is deemed as risk.
A widely adopted risk measure is mean-variance introduced by Nobel laureate economist
Harry Markowits in 1952 [147]. Specifically, the mean-variance MV.X/ of a random variable
X is given by

MV.X / D Var.X/  .X /; (5.12)

where Var.X/ and .X/ are, respectively, the variance and the mean of X , the coefficient  > 0
is the risk tolerance factor that balances the two objectives of high return and low risk. The
definition of mean-variance can be interpreted as the Lagrangian relaxation of the constrained
optimization problem of minimizing the risk (measured by the variance) for a given expected
return or maximizing the expected return for a given level of risk. Its quadratic scaling captures
the natural inclination of human decision makers that favor less risky options when the stakes
are high.
Mean-variance is defined with respect to a random variable. To adopt this risk measure
in bandit models, the first question is how to define the mean-variance of a random sequence
fX.t / .t/gTtD1 of rewards obtained under policy  . Three definitions exist in the literature, which
we refer to as the Global Risk Constraint (GRC), Local Risk Constraint (LRC), and Empirical Risk
Constraint (ERC).
Under GRC, risk constraint is imposed on the total reward seen at the end of the time
horizon. The mean-variance of the random reward sequence is thus defined as the mean-variance
104 5. VARIANTS OF THE FREQUENTIST BANDIT MODEL
of the sum of the rewards:
!
˚ T  T
X
MVG X.t/ .t/ tD1 D MV X.t/ .t / (5.13)
tD1
T
! T
!
X X
D Var X.t / .t/ E X.t/ .t/ : (5.14)
t D1 t D1

This risk measure is more suitable in applications such as retirement investment where the de-
cision maker is more concerned with the final return but less sensitive to the fluctuations in the
intermediate returns.
Under LRC, risk constraint is imposed on the random reward obtained at each time. The
mean-variance of the random reward sequence is defined as the sum of the mean-variances over
time:
˚ T  T
X 
MVL X.t/ .t/ t D1
D MV X.t/ .t/ (5.15)
t D1
T
X T
X
  
D Var X.t / .t/  E X.t / .t/ : (5.16)
t D1 tD1

This risk measure is more suitable in applications such as clinical trial where the risk in each
chosen action needs to be constrained.
Under ERC, risk manifests in the inter-temporal fluctuation in the realized reward sample
path and is measured by the empirical variance. For a given reward sample path fx.t/gTtD1 , its
empirical mean-variance is defined as
     
MVE fx.t/gTtD1 D Var fx.t /gTtD1   fx.t/gTtD1 (5.17)
T T
!2 T
X 1 X X
D x.t / x.t/  x.t /; (5.18)
t D1
T tD1 t D1

where the first term is the empirical variance and the second term the empirical mean, both
without the normalization term of 1=T . Not normalizing with the horizon length allows us to
preserve the same relation of the performance measure with the horizon length T as seen in
the canonical bandit models as well as in the risk-averse bandits under GRC and LRC. We can
then compare the regret scaling behaviors in T on an equal footing.
Averaging over all sample paths gives the empirical mean-variance of the random reward
sequence:
˚  
T  T

MVE X.t/ .t / t D1 D E MV fx.t/g t D1 ; (5.19)
5.4. VARIATIONS IN THE PERFORMANCE MEASURE 105
where E , as usual, denotes the expectation over the stochastic processes fX.t / .t/gTtD1
induced
by the policy  . This risk measure, first introduced by Sani, Lazaric, and Munos, 2012 [175], di-
rectly targets at inter-temporal fluctuations in each realized return process. Such inter-temporal
variations are commonly referred to as volatility in portfolio selection and investment (see
French, Shwert, and Stambaugh, 1987 [80]) or risk for financial security (Bradfield, 2007 [42]).

Regret decomposition: The above three mean-variance measures are much more complex
objective functions than the expected total reward measure in the risk-neutral canonical bandit
model. The first challenge we face is that finding the optimal policy under a known model is no
longer straightforward or even tractable.
Consider first the ERC model. It can be shown that playing the arm i that has the small-
est mean-variance may not be optimal (see a counter example in Vakili and Zhao, 2016 [196]).
To see this, the key is to notice that the variance term (i.e., the first term on the right-hand side
of (5.18)) in the empirical mean-variance measure is with respect to the sample mean calcu-
lated from rewards obtained from all arms. When the remaining time horizon is short and the
current sample mean is sufficiently close to the mean value of a suboptimal arm j ¤ i , it may
be more rewarding (in terms of minimizing the mean-variance) to play arm j rather than arm
i . In general, the oracle policy  has complex dependencies on the reward distributions F and
the horizon length T , and a general analytical characterization appears to be intractable.
This difficulty can be circumvented again by finding a proxy for the oracle. In the case
of ERC, the proxy for the oracle is the single-arm policy that always plays the arm i with
the minimum mean-variance. The performance loss of this optimal single-arm policy b   as
compared to  is bounded by an O.N log T / factor (Vakili and Zhao, 2016 [196]):
8 0 1 9
< X €2 N =
MVE .O  / MVE . /  min max 2 @ i
C 1A ; log T ; (5.20)
: i a ;
i ¤i

where i D MVi MVi , €i D i i . The proxy regret has the following form (Vak-
ili and Zhao, 2016 [196]):
2 !2 3
N
X N
1 4 X
RO  .T I F/ D .i C €i2 / EŒi .T / E .O i .T / i /i .T / 5 C i2 ; (5.21)
T
i D1 i D1

1 Pi .T /
where i2 is the variance of arm i and O i .T / D i .T / sD1 Xi .ts / is the empirical mean of
the entire realized reward sequence from arm i (ts denotes the time instant for the s th play of
arm i ).
Note that under the measure of mean-variance, regret can no longer be written as the sum
of certain properly defined immediate performance loss at each time instant. More specifically,
under ERC, the contribution from playing a suboptimal arm at a given time t to the overall
106 5. VARIANTS OF THE FREQUENTIST BANDIT MODEL
regret cannot be determined without knowing the entire sequence of decisions and observa-
tions. Furthermore, regret in mean-variance involves higher order statistics of the random time
fi .T /gN
i D1 spent on each arm. Specifically, while the regret in terms of the total expected reward
can be written as a weighted sum of the expected value of i .T / (see (4.6)), regret in terms of
mean-variance depends on not only the expected value of i .T /, but also the second moment of
i .T / and the cross correlation between i .T / and j .T /. These fundamental differences in the
behavior of regret are what render the problem difficult and call for different techniques from
that used in risk-neutral bandit problems.
Similarly in GRC, the single-arm policy b   is not the optimal policy under known models
but can serve as a proxy for the oracle  with a similar performance gap (see Vakili and Zhao,
2015 [195]). In LRC, intuitive but not immediately obvious, the oracle  is actually the single-
arm policy that plays the arm i with the minimum mean-variance (Vakili, Boukouvalas, and
Zhao, 2019 [193]). Regret expressions under GRC and LRC can be found in Vakili and Zhao,
2015 [195] and Vakili, Boukouvalas, and Zhao, 2019 [193], respectively, which show depen-
dencies on the second-order moments of fi .T /gN i D1 in the case of GRC and on the cumulative
variance in the action sequence in the case of LRC. The dependency of regret on the variance
of the chosen action sequence under LRC is quite unique. It shows that regret depends not
only on statistics of the total times spent on suboptimal arms, the uncertainty (i.e., the cumula-
tive variance) in the action sequence itself is penalized. This indicates that the LRC criterion in
certain sense captures the learner’s interest in robust decisions and outcomes in a risk-sensitive
environment.

Regret lower bounds and order-optimal policies: Establishing regret lower bounds under
the mean-variance measures relies on carefully bounding each term in the regret decomposition.
Within the uniform-dominance framework, by following a similar line of arguments as in Lai
and Robbins, 1985 [126], on the risk-neutral bandit model, the same .log T / regret lower
bounds can be established under each of the three mean-variance measures. The impact of the
risk constraint on regret is absorbed into the leading constants of the log T order under all
three risk measures. In particular, the leading constants have a much stronger dependency on
the distribution parameters fi ; €i gN i D1 and increase at a much faster rate as mini D1;:::;N i
diminishes (i.e., the optimal arm i in terms of mean-variance becomes harder to identify) while
keeping f€i gNi D1 bounded away from 0.
Such strong dependencies of the leading constants on arm configurations translate to sig-
nificantly higher minimax regret order which demands a leading constant that holds uniformly
for all arm configurations and horizon length T . As a result, the lower bounds on the mini-
max regret have significantly higher orders than their risk-neutral counterparts. Specifically, the
minimax regrets under ERC and LRC are lower bounded by .T 2=3 / and .T /, respectively, as
shown in Vakili and Zhao, 2016 [196] and Vakili, Boukouvalas, and Zhao, 2019 [193]. Of par-
ticular significance is that sublinear regret order is no longer possible under LRC in the minimax
setting. The lower bound on the minimax regret under GRC remains open.
5.4. VARIATIONS IN THE PERFORMANCE MEASURE 107
We now consider risk-averse policies. We present modifications of UCB and DSEE, two
representative policies from, respectively, the adaptive and open-loop control families. We focus
on the case of ERC; similar variants of UCB and DSEE can be constructed for GRC and LRC.
Referred to as MV-UCB, this variant of UCB-˛ is similar to that for the risk-neutral
bandit (see Algorithm 4.3). The main difference in the index form is that the sample mean is
replaced by the sample mean-various MVi .t/ and the policy parameter ˛ needs to be set to
different values dependent on the risk tolerance parameter . Specifically, the index of arm i at
time t C 1 (i.e., after t plays) is

s
˛ log t
Ii .t C 1/ D MVi .t/ ; (5.22)
i .t /

and the arm with the smallest index is chosen for activation. Note also the minus sign of the
second term of the index. This is due to the minimization rather than maximization nature of
the mean-variance criterion. Consequently, a lower confidence bound is needed.
The MV-UCB was first introduced p and analyzed by Sani, Lazaric, and
Munos, 2012 [175], who showed a T upper bound on the problem-specific regret or-
der of MV-UCB. This bound was later tightened to log T in Vakili and Zhao, 2016 [196],
which, together with the lower bound, demonstrates the order optimality of MV-UCB in the
uniform-dominance setting (under the condition that i is positive for all i ). The minimax
regret of MV-UCB, however, scales linearly with T .
MV-DSEE is a variation of DSEE by replacing the arm with the largest sample mean
by the arm with the smallest sample mean-variance in the exploitation sequence. It achieves
the optimal regret orders in both the uniform-dominance setting and the minimax setting by
choosing a proper cardinality of the exploration sequence.

Risk neutral vs. risk averse: Table 5.1 compares the regret performance under risk-neutral and
risk-averse performance measures. This comparison shows that R .T I F/ has a much stronger
dependency on F under risk measures and achievable minimax regret has significantly higher
order. The comparison between adaptive policies such as MV-UCB and open-loop control of
exploration and exploitation such as MV-DSEE shows that under risk measures which pe-
nalize variations, open-loop policies with less randomness in actions may have an edge over
adaptive policies. In particular, under ERC, MV-DSEE achieves the optimal minimax regret
order, while MV-UCB incurs linear regret. Under GRC, MV-DSEE preserves its logarithmic
problem-specific regret order, while MV-UCB has an order of log2 T . This can be seen from
the dependence of the regret under ERC (see (5.21)) and GRC on the second moments of the
random times fi .T /gN i D1 . The deterministic nature of DSEE results in deterministic values of
fi .T /gN
i D1 with zero variance, thus favorable regret order.
108 5. VARIANTS OF THE FREQUENTIST BANDIT MODEL
Table 5.1: Regret performance under risk-neutral and risk-averse measures

Problem-Specific Regret Minimax Regret


Lower
Policies Lower Bound Policies
Bound

Risk MV-UCB: O(logT ) √ MV-UCB: O( T )
Ω(log T ) Ω( T ) √
Neutral MV-DSEE: O(logT ) MV-DSEE: O( T )
ERC MV-UCB: O(log T ) MV-UCB: O(T )
Ω(log T ) Ω(T 2/3)
MV-DSEE: O(log T ) MV-DSEE: O(T 2/3)
LRC MV-UCB: O(log T )
Ω(log T ) Ω(T ) O(T )
MV-DSEE: O(log T )
GRC MV-UCB: O(log2 T )
Ω(log T ) ? ?
MV-DSEE: O(log T )

Other risk measures: There are also a couple of results on MAB under the measure of value
at risk (see Galichet, Sebag, and Teytaud, 2013 [83]; Vakili and Zhao, 2015 [195]) and a risk
measure based on the logarithm of moment generating function (Maillard, 2013 [141]).
A corresponding expert setting (i.e., non-stochastic formulation with full-information
feedback) was studied by Even-Dar, Kearns, and Wortman, 2006 [77], where a negative result
was established, showing the infeasibility of sublinear regret under the mean-variance measure.
A related line of studies concerns with the deviation of regret (in terms of cumulative
rewards) from its expected value, and high probability bounds have been established by Audibert,
Munos, and Szepesvari, 2009 [21] and Salomon and Audibert, 2011 [174].

5.4.2 PURE-EXPLORATION BANDITS: ACTIVE INFERENCE


There is also an inference version of the bandit model with an objective of identifying the top
K arms in terms of their mean rewards (without loss of generality, we assume K D 1). Under
this objective, actions taken during the process are purely for exploration, and rewards generated
during the process are merely observations. For this reason, this bandit model is often referred
to as pure-exploration bandits.
The problem is one of active composite hypothesis testing. It is composite since observations
under a given hypothesis (e.g., arm 1 is the best arm) may be drawn from a set of distributions
as compared to a single distribution under a simple hypothesis. Specifically, the probabilistic
model of the observations under the hypothesis that arm 1 is the best can be any of the arm
configurations FN satisfying 1 D maxi D1;:::;N i . The problem is an active hypothesis testing
due to the decision maker’s control on where to draw the observations for the inference task by
choosing judiciously which arm to pull at each time.
5.4. VARIATIONS IN THE PERFORMANCE MEASURE 109
As in classical hypothesis testing, there are two ways to pose the problem: the fixed-
sample-size setting and the sequential setting. In the context of pure-exploration bandits, they
are often referred to as the fixed-budget and the fixed-confidence settings.
In the fixed-budget setting, the total number of observations/samples is prefixed to T .
The objective is to optimize the accuracy of the inference at the end of the time horizon. The
inference accuracy can be measured by the probability that the identified best arm iO is indeed
the one with the highest mean or by the so-called simple regret given by the gap  iO in
their mean values. An inference strategy (also referred to as a test) consists of an arm selection
rule governing which arm to pull at each time t D 1; : : : ; T and a decision rule governing which
arm to declare as the best at the end of the detection process.
In the fixed-confidence setting, the goal is to reach a given level of detection accuracy
(e.g., PrŒiO ¤ i    ) with a minimum number of samples. The decision horizon is thus random
and controlled by the decision maker. An inference strategy consists of, in addition to an arm
selection rule and a decision rule as in the fixed-budget setting, a stopping rule that determines
when the required detection accuracy is met and no more samples are necessary. Note that the
stopping rule is given by a stopping time, not a last-passage time. The latter refers to the time at
which the required detection accuracy is met, but the decision maker may not be able to conclude
that from past observations. In other words, a sequential inference strategy needs to be able to
self terminate.
The general problem of active hypothesis testing was pioneered by Chernoff, 1959 [62],
under the term “sequential design of experiments,” which was also used in reference to the
bandit problem as in Robbins, 1952 [168], Bradt, Johnson, and Karlin, 1956 [43], and Bell-
man, 1956 [30] in earlier days of the bandit literature. Indeed, to motivate the sequential design
problem for testing hypotheses, Chernoff considered the same clinical trial problem posed by
Thompson, 1933 [190]. We present in the example below this two-armed bandit problem and
give the basic idea of the test developed by Chernoff.

Example 5.2 A Two-Armed Pure-Exploration Bandit and the Chernoff Test:


There are two drugs with unknown success probabilities of 1 and 2 , respectively. The objective
is to identify, with sufficient accuracy, the drug with higher efficacy using a minimum number
of experiments (i.e., the fixed-confidence setting).
Let the two hypotheses be denoted as H1 W 1 > 2 and H2 W 1  2 . Let

‚H1 D f D .1 ; 2 / W 1 > 2 g; ‚H2 D f D .1 ; 2 / W 1  2 g (5.23)

be the parameter sets corresponding to the two hypotheses.


Suppose that we have conducted t experiments, consisting of t1 trials of drug 1 with m1
successes and t2 trials of drug 2 with m2 successes. We now need to decide which drug to test
next, in other words, given what we have observed, whether testing drug 1 or durg 2 would give
us more information on which hypothesis is true.
110 5. VARIANTS OF THE FREQUENTIST BANDIT MODEL
The basic idea proposed by Chernoff for selecting the next experiment is as follows. Given
the past t observations, the maximum likelihood estimate of  is given by
 
O m1 m2
D ; : (5.24)
t1 t2

Without loss of generality, suppose that mt11 > mt22 . In other words, the maximum likelihood
estimate says H1 is true. An appropriate experiment to take next is one that provides more
information for testing the hypothesis  D O vs. the simple alternative  D Q where Q 2 ‚H2
is the parameter point under the alternative hypothesis H2 that is the most “difficult” to be
differentiated from the maximum likelihood estimate O . To make this criterion for selecting the
next experiment precise, we need to specify a measure for the amount of information provided
by the outcome of an experiment and a way for selecting the most “difficult” alternative Q 2
‚H2 . For the former, a natural candidate is the KL divergence between the two distributions
of observations induced by O and Q , which determines the rate at which the hypothesis  D
O (assumed to be the ground truth) can be differentiated from the simple alternative  D Q .
O I
Specifically, Let D.jj Q E/ denote the corresponding KL divergence under experiment E . For
the available two experiments of testing drug 1 (E D 1) and testing drug 2 (E D 2), we have

Q E D 1/ D D.O1 jjQ1 / D O1 log O1 1 O1


O I
D.jj C .1 O1 / log ; (5.25)
Q1 1 Q1
O O2
O I
D.jj Q E D 2/ D D.O2 jjQ2 / D O2 log 2 C .1 O2 / log
1
; (5.26)
Q2 1 Q2

which follows from the expression of the KL divergence between two Bernoulli distributions.
For given O and Q , the experiment that maximizes D.jj
O I
Q E/ should be selected.
For specifying the most “difficult” parameter point Q 2 ‚H2 , we cast the problem as a
zero-sum game with the payoff determined by D.jj O I
Q E/. The decision maker choose the ex-
periment E to maximize D.jjI E/, while nature chooses Q 2 ‚H2 to minimize D.jj
O Q O I
Q E/.
The solution for the decision maker to this zero-sum game is a mixed strategy that chooses
E D 1 with probability p and E D 2 with probability 1 p where p is determined by the
following maximin problem:
 
p D arg max min pD.jj O I
Q E D 1/ C .1 p/D.jj O I
Q E D 2/ (5.27)
Q
p2Œ0;1 2‚ H2
 
D arg max min pD.O1 jjQ1 / C .1 p/D.O2 jjQ2 / : (5.28)
Q
p2Œ0;1 2‚H2

The above specifies the selection rule for choosing experiments. The stopping rule is as
follows. Suppose again that the current maximum likelihood estimate O D . mt11 ; mt22 / satisfies
m1
t1
> mt22 . The maximum likelihood estimate L of  under the alternative hypothesis H2 is given
5.4. VARIATIONS IN THE PERFORMANCE MEASURE 111
by  
m1 C m2 m1 C m2
L D ; : (5.29)
t1 C t2 t1 C t2
Note that the maximum likelihood estimate of  under H2 is the parameter point in ‚H2 that
is most likely to generate the given observations of m1 successes from t1 trials of drug 1 and
m2 successes from t2 trials of drug 2, which would be the point of 1 D 2 for the observations
satisfying mt11 > mt22 . The stopping rule is then to terminate the test when the observations give
sufficient evidence to differentiate the most likely point O under H1 from the most likely point
L under H2 . Specifically, the termination condition is that the sum of the log-likelihood ratio
of O to L exceeds a threshold determined by the required inference accuracy:

O1 1 O1 O2 1 O2


m1 log C .t1 m1 / log C m2 log C .t2 m2 / log > log c; (5.30)
L1 1 L1 L2 1 L2
where c is the cost for each sample/experiment, which can be intepretated as the inverse of the
Lagrange multiplier for the constraint on the detection error.
The decision rule at the time of stopping is to simply declare the hypothesis corresponding
to the maximum likelihood estimate O (i.e., declare H1 if mt11 > mt22 and H2 otherwise). An
implementation of the above Chernoff test is given in Algorithm 5.6.

The above Chernoff test applies to general active composite hypothesis testing problems
with general distributions and an arbitrary finite number of experiments, but involving only two
hypotheses and a finite number of states of nature. Chernoff, 1959 [62] showed that this test
is asymptotically optimal in terms of its Bayes risk as c —equivalently, the maximum allowed
probability of detection error  —tends to zero.
Bessler, 1960 [38] extended Chernoff ’s result to cases with an arbitrary finite number
of hypotheses and a potentially infinite number of experiments. Albert, 1961 [10] considered
a more complex composite model for the hypotheses, allowing an infinite number of states
of nature. Recent results on variations and extensions include Nitinawarat, Atia, Veeravalli,
2013 [161] (where the problem was referred to as controlled sensing for hypothesis testing)
and Naghshvar and Javidi, 2013 [151].
The pure-exploration bandit problem is a special case of active composite hypothesis test-
ing. In particular, in a general active hypothesis testing problem, the number of hypothesis can
be independent of the number of experiments, while in pure-exploration bandits, these two
are equal and are given by the number of arms. More importantly, the observation distribu-
tions can be arbitrary across hypotheses and available experiments in a general active hypothesis
testing problem. In pure-exploration bandits, however, the observation distributions for the N 2
hypothesis-action pairs are coupled through a fixed set of N reward distributions. Representative
studies on the pure-exploration bandits include Mannor and Tsitsiklis, 2004 [145], Even-Dar,
Mannor, Mansour, 2006 [78], Audibert, Bubeck, and Munos, 2010 [22], and Bubeck, Munos,
112 5. VARIANTS OF THE FREQUENTIST BANDIT MODEL
Algorithm 5.6 The Chernoff Test for a Pure-Exploration Two-Arm Bernoulli Bandit
Notations: .mi ; ti /: the number of successes and the number of plays of arm i up to the current
time;
Input: c : the cost of each observation/experiment.

1: Initialization:
2: Pull each arm once in the first two plays.
3: Set TERMINATE D 0.
4: while TERMINATE D 0 do
5: Obtain the maximum likelihood estimate O D . mt11 ; mt22 /.
6: Obtain the maximum likelihood estimate under the alternative hypothesis: L D
. mt11 Cm
Ct2
2 m1 Cm2
; t1 Ct2 /.
7: if the termination condition in (5.30) is met then
8: TERMINATE D 1.
9: Declare the hypothesis corresponding to O .
10: else
11: Compute the probability p as given in (5.28) (replacing ‚H2 with ‚H1 if mt11  mt22 ).
12: Play arm 1 with probability p and arm 2 with probability 1 p .
13: Update .mi ; ti / based on the chosen action and the resulting outcome.
14: end if
15: end while

Stoltz, 2011 [47]. A variation of this model is the thresholding bandits where the objective is
to identify arms with mean values above a given threshold (Locatelli, Gutzeit, and Carpentier,
2016 [138]). We see in Section 6.1.3 a variant of the thresholding bandits that allows arm ag-
gregation in the application of heavy hitter detection in Internet traffic engineering and security.

5.5 LEARNING IN CONTEXT: BANDITS WITH SIDE


INFORMATION
Consider again the example of clinical trial. Suppose that the patient population consists of
two types. For the first type of patients, which makes up 80% of the population, the efficacy
of the two drugs is given by 1.1/ D 0:7 and 2.1/ D 0:4. For the second type, 1.2/ D 0:2, 2.2/ D
0:7. At each time t , we face a patient chosen uniformly at random from the population. If the
type of the patient is not observable, the problem is the classic bandit model we have discussed
with the unknown mean rewards of the two arms given by 1 D 0:81.1/ C 0:21.2/ D 0:6 and
5.5. LEARNING IN CONTEXT: BANDITS WITH SIDE INFORMATION 113
2 D 0:82.1/ C 0:22.2/ D 0:46. The optimal arm here is arm 1, and online learning algorithms
discussed in Chapter 4 would converge to the optimal average performance of arm 1 and cure
60% of patients by prescribing the first drug to all but a log T order of patients. Type 2 patients,
however, are prescribed with the inferior drug most of the time and experience a success rate of
only 40%.
It is easy to see that if the type of each patient is known, better performance can be
achieved. The problem can be simply decomposed into two independent bandit problems inter-
leaved in time, one for each type of patients. For each type of patients, the better drug (drug 1
for type 1 and drug 2 for type 2) would be prescribed most of the time, achieving the optimal
average performance of 0:7 for each type.
This is the rationale behind the contextual bandit model, also referred to as bandits
with side information and bandits with covariates. Motivating applications include personal-
ized medicine, personalized advertizement and product/content recommendations, where cer-
tain user features (referred to as the context) may be known.

Definition 5.3 The Contextual Bandit Model:


At each time t , a context Y is drawn, i.i.d. over time, from an unknown distribution FY .y/ and
revealed to the player. In context Y D y , rewards from arm i (i D 1; : : : ; N ) obey an unknown
distribution Fi .xjy/. Rewards are independent over time and across arms. Let i .y/ denote the
expected reward of arm i in context y . Define
 
 .y/ D max i .y/; i .y/ D arg max i .y/ (5.31)
i D1;:::;N i D1;:::;N

as, respectively, the maximum expected reward and the optimal arm in context y . The regret of
policy  over T plays is given by
T
X  
R .T / D EY  .Y / .t/ .Y / ; (5.32)
t D1

where .t / denotes the arm chosen by  at time t .

While the gain in performance from such contextual information can be significant, the
problem does not offer much intellectually if it simply decomposes into independent bandit
problems based on the context. Interesting formulations arise when certain coupling across these
bandit problems in different contexts is present and can be leveraged in learning.
Three coupling models have been considered in the literature. The first model is paramet-
ric, assuming that the reward distributions of arm i in different contexts are coupled through
a common unknown parameter i . Specifically, pulling arm i in context y 2 Y generates ran-
dom rewards according to a known distribution Fi .xI i j y/ with an unknown parameter i (see
Wang, Kulkarni, and Poor, 2005 [205]). For instance, for a binary context space Y D f 1; 1g,
114 5. VARIANTS OF THE FREQUENTIST BANDIT MODEL
the reward distribution of arm i under each context y 2 f 1; 1g is Gaussian with mean yi and
variance 1. Another example is a linear parametric coupling in the mean rewards of arms in
different contexts (see Li et al., 2010 [131] and Chu et al., 2011 [64]). Specifically, the context
is assumed to be a d -dimensional feature vector y, and the expected reward of arm i in context
y is given by i .y/ D yT i for an unknown parameter vector i of arm i . Under this model,
observations obtained in all contexts can be used collectively to infer the underlying common
parameters, which then guides the arm selection tailored specifically to each context.
The second model stems from viewing the problem as an online reward-sensitive multi-
class classification where the context y at each time is the instance and the arm pulling action
equates to assigning a label i 2 f1; : : : ; N g to y . The most rewarding label of y is i .y/. A com-
monly adopted formulation in such supervised learning problems is to specify a hypothesis set
consisting of all classifiers under consideration, and the objective is to learn the best classifier in
this set. In the context of contextual bandits, the hypothesis set specifies mappings from con-
texts to arms. Regret is then defined with respect to the best mapping in this set (Langford and
Zhang, 2007 [128] and Seldin et al., 2011 [176]). If the hypothesis set includes the optimal
mapping y ! i .y/, the corresponding regret is the same as defined in (5.32). The inherent
structure of the hypothesis set couples the different bandit problems across contexts. Consider
a rather extreme example: if all mappings in the hypothesis set map contexts y1 and y2 to an
identical arm, then bandit problems corresponding to these two contexts can be aggregated as a
single learning problem. On the other hand, if the hypothesis set contains all possible mappings
from Y to f1; : : : ; N g, then no structure exists, and the contextual bandit problem decomposes
into independent bandit problems across contexts.
In the third model, the context is assumed to take continuum values (or in a general metric
space). The mean reward i .y/ of arm i is assumed to be a Lipschitz continuous function of the
context y . In other words, each arm yields similar rewards (in expectation) in similar contexts.
This naturally leads to an approach that discretizes the context space Y and aggregates contexts in
the same discrete bin as a single bandit problem while decoupling contexts across different bins
as independent bandit problems (see Rigollet and Zeevi, 2010 [167]). The zooming algorithm
developed by Kleinberg, Slivkins, and Upfal, 2015 [117] for Lipschitz bandits has been extended
to contextual bandit problems where the Cartesian product of the context space and the arm
space belongs to a general metric space and the mean reward .a; y/ as a function of the arm a
and the context y is Lipschitz (Slivkins, 2014 [178]). As discussed in Section 5.2.1, the problem
will be more interesting if, instead of an open-loop discretization tailored to the worst-case
smoothness condition, the objective is set to a discretization strategy that automatically adapts
to the local smoothness of the unknown mean reward functions fi .y/gN iD1 .
5.6. LEARNING UNDER COMPETITION: BANDITS WITH MULTIPLE PLAYERS 115
5.6 LEARNING UNDER COMPETITION: BANDITS WITH
MULTIPLE PLAYERS
The canonical bandit model and the variants discussed above all assume a single player. Many
applications give rise to a bandit problem with multiple players. What couples these players
together is the contention among players who pull the same arm. When multiple players choose
the same arm, the reward offered by the arm is distributed arbitrarily among the players, not
necessarily with conservation. Such an event is referred to as a collision.
Depending on whether each player has access to other players’ past actions and observa-
tions, the problem has been addressed under two settings: the centralized and the distributed
settings.

5.6.1 CENTRALIZED LEARNING


Consider a bandit model with N arms and K players with K < N . When multiple players pull
the same arm, no reward is accrued from this arm by any player. Extensions to other collision
models where one player captures the reward or players share the reward in a certain way are often
straightforward. We assume that the expected reward is nonnegative for all arms. Collisions are
thus undesired events. Regret is defined in terms of the system-level reward—the total reward
summed over all players.
In the centralized setting, each player has access to all other players’ past actions and obser-
vations, and can thus coordinate their actions and act collectively as a single player. The problem
can thus be treated as a bandit model with a single player who pulls K arms simultaneously at
each time. Referred to as bandits with multiple plays, this model was studied by Anantharam,
Varaiya, and Walrand, 1987a [15] under the uniform-dominance approach, where the technical
treatment follows that of Lai and Robbins, 1985 [126].
A more complex model is when each arm offers different rewards when pulled by different
players. This scenario arises in applications such as multichannel dynamic access in wireless
networks where users experience different channel qualities due to, for example, differences in
their geographical locations. Another application example is multi-commodity bidding where a
commodity may have different values to different bidders.
Let ij denote the expected reward offered by arm i when pulled by player j . Under the
known model, the oracle policy is given by a maximum bipartite matching. Specifically, consider
a bipartite graph with two sets of vertices representing, respectively, the set of N arms and the set
of K players. An edge exists from arm i to player j with weight ij . A maximum matching on
this bipartite graph gives the optimal matching of the K players to the N arms and leads to the
maximum sum reward that defines the benchmark. When the model is unknown, the problem
is a special case of the combinatorial semi-bandit model discussed in Section 5.2.1. Specifically,
using the graphical random field model as illustrated in Figure 5.1, each matching  of the K
players to N arms constitutes a super-arm X . The latent variables are the random reward Yi;j
from arm i pulled by player j . Let .j / denote the arm assigned to player j under matching  .
116 5. VARIANTS OF THE FREQUENTIST BANDIT MODEL
Then
K
X
X D Y.j /;j ; (5.33)
j D1

where each Y.j /;j is observed when matching  is engaged (i.e., the semi-bandit feedback).
A UCB-based learning policy was developed by Gai, Krishnamachari, and Jain, 2011 [82],
where a bipartite matching algorithm is carried out at each time. The unknown edge weights
fi;j gi D1;:::;N Ij D1;:::;K of the bipartite graph of arms and players is replaced by the UCB-˛
indexes calculated from past observations. A matching algorithm (e.g., the Hungarian algo-
rithm [122]) is then carried out on this bipartite graph to find the best matching, which deter-
mines the action of each player at next time. This algorithm was shown to achieve the optimal
logarithmic problem-specific regret order.

5.6.2 DISTRIBUTED LEARNING


In the distributed setting, each player, not able to observe other players’ actions or rewards, de-
cides which arm to pull based on its own local observations. A decentralized learning policy
 is given by the concatenation of the local polices for all players:  D .1 ; : : : ; K /. The per-
formance benchmark is set to be the same as in the centralized setting: the oracle knows the
reward model and carries out a centralized optimal matching of players to arms that eliminates
collisions among players.
Under a distributed learning algorithm, however, collisions are inevitable. Collisions re-
duce not only the system-level reward, but also available observations for learning. If collisions
are unobservable, then observations are contaminated: players do not know whether the observed
reward truly reflect the quality of the chosen arm. Nevertheless, when the reward mean of each
arm is identical across players, the optimal logarithmic regret order can be achieved as in the
centralized setting. Several distributed learning policies with the optimal regret order have been
developed in various settings regarding fairness among players and the observability of collision
events (see, for example, Liu and Zhao, 2010 [136]; Anandkumar et al., 2011 [14]; Vakili et al.,
2013 [194]; Gai and Krishnamachari, 2014 [81]; Rosenski, Shamir, and Szlak, 2016 [169]; and
references therein).
In the case where each arm offers different rewards when pulled by different players, the
problem is more complex. Kalathil, Nayyar, and Jain, 2014 [105], developed a learning pol-
icy that integrates a distributed bipartite matching algorithm (e.g., the auction algorithm by
Bertsekas, 1992 [34]) with the UCB-˛ policy and showed that it achieves O.log2 T / problem-
specific regret order. Recently, Bistritz and Leshem, 2018 [39] closed the gap and developed a
policy with a near-O.log T / regret order. This policy leverages recent results on a payoff-based
decentralized learning rule for a generic repeated game by Pradelski and Young, 2012 [163]
and Marden, Yong, and Pao, 2014 [146]. The problem is also recently studied under the non-
stochastic setting by Bubeck et al., 2019 [51].
117

CHAPTER 6

Application Examples
We consider in this chapter a couple of representative application examples in communication
networks and social-economical systems. We hope to illustrate not only the general applicability
of the various bandit models, but also the connections and differences between the Bayesian and
the frequentist approaches. Our focus will be on the bandit formulations of these application
examples. Key ideas are highlighted with details of the specific solutions left to the provided
references.

6.1 COMMUNICATION AND COMPUTER NETWORKS


6.1.1 DYNAMIC MULTICHANNEL ACCESS
Consider the problem of probing N independent Markov chains. Each chain has two states—
good (1) and bad (0)—with transition probabilities fp01 ; p11 g. At each time, a player chooses
K .1  K < N / chains to probe and receives a reward for each probed chain that is in the good
state. The objective is to design an optimal policy that governs the selection of K chains at each
time to maximize the long-run reward.
The above general problem is archetypal of dynamic multichannel access in communi-
cation systems, including cognitive radio networks, downlink scheduling in cellular systems,
opportunistic transmission over fading channels, and resource-constrained jamming and anti-
jamming. In the communications context, the N independent Markov chains correspond to N
communication channels under the Gilbert-Elliot channel model (Gilbert, 1960 [86]), which
has been commonly used to abstract physical channels with memory. The state of a channel
models the communication quality and determines the reward of accessing this channel. For
example, in cognitive radio networks where secondary users search in the spectrum for idle
channels temporarily unused by primary users (Zhao and Sadler, 2007 [219]), the state of a
channel models the occupancy of the channel by primary users. For downlink scheduling in cel-
lular systems, the user is a base station, and each channel is associated with a downlink mobile
receiver. Downlink receiver scheduling is thus equivalent to channel selection. The application
of this problem also goes beyond communication systems. For example, it has applications in
target tracking as considered by Le Ny, Dahleh, and Feron, 2008 [130], where K unmanned
aerial vehicles are tracking the states of N .N > K/ targets in each slot.

Known Markovian dynamics—a restless bandit problem: Assume first that the Markov
transition probabilities fp01 ; p11 g are known. We show that in this case, the problem is a restless
118 6. APPLICATION EXAMPLES
multi-armed bandit problem discussed in Section 3.3. Without loss of generality, we focus on
the case of K D 1.
Let ŒS1 .t /; : : : ; SN .t/ 2 f0; 1gN denote the channel states in slot t . They are not directly
observable before the sensing action is made. The user can, however, infer the channel states
from its decision and observation history. A sufficient statistic for optimal decision making is
given by the conditional probability that each channel is in state 1 given all past decisions and
observations (see, e.g., Smallwood and Sondik, 1971 [181]). Referred to as the belief vector or

information state, this sufficient statistic is denoted by !.t/ D Œ!1 .t/;    ; !N .t/, where !i .t/
is the conditional probability that Si .t/ D 1. Given the sensing action a and the observation in
slot t , the belief state in slot t C 1 can be obtained recursively as follows:
8
< p11 ; if i D a and Si .t / D 1
!i .t C 1/ D p ; if i D a and Si .t / D 0 ; (6.1)
: 01
ˆ.!i .t//; if i ¤ a
where
ˆ.!i .t// , !i .t /p11 C .1 !i .t //p01
denotes the operator for the one-step belief update for unobserved channels. The initial be-
lief vector !.1/ can be set to the stationary distribution of the underlying Markov chain, if no
information on the initial system state is available.
It is now easy to see that the problem is a restless bandit problem, where each channel
constitutes an arm and the state of arm i in slot t is the belief state !i .t/. The user chooses an
arm a to activate (sense), while other arms are made passive (unobserved). The active arm offers
an expected reward of Ra .!a .t// D !a .t/, depending on the state !a .t / of the chosen arm. The
states of active and passive arms change as given in (6.1). The performance measure can be set to
either total discounted reward or average reward. The former, besides capturing delay-sensitive
scenarios, also applies when the horizon length is a geometrically distributed random variable.
For example, a communication session may end at a random time, and the user aims to maximize
the number of packets delivered before the session ends. The latter is the common measure of
throughput in the context of communications.
For this special case of restless bandits, the underlying two-state Markov chain that gov-
erns the state transition of each arm brings rich structures into the problem, leading to positive
outcomes on the indexability and optimality of the Whittle index policy. Specifically, the index-
ability and the closed-form expression of the Whittle index under the total-discounted-reward
criterion were established independently and using different techniques by Le Ny, Dahleh, and
Feron, 2008 [130] and Liu and Zhao, 2008 [134]. The semi-universal structure and the op-
timality of the Whittle index policy and the analysis under the average-reward criterion were
given by Liu and Zhao, 2010 [135].
In particular, for this restless bandit problem, the Whittle index policy is optimal for all
N and K in the case of positively correlated channels. For negatively correlated channels, the
6.1. COMMUNICATION AND COMPUTER NETWORKS 119
Whittle index policy is optimal for K D 2 and K D N 1. For a general K , bounds on the
approximation ratio of the Whittle index policy were given by Liu and Zhao, 2010 [135]. The
optimality of the Whittle index policy follows directly from its equivalence to the myopic policy
and the optimality of the myopic policy established by Ahmad et al., 2009 [9].
The semi-universal structure of the Whittle index policy renders explicit calculation of
the Whittle index unnecessary when implementing the policy (note that it is the ranking not
the specific values of the arm indices that determine the actions). Actions can be determined
by maintaining a simple queue with no computation. Specifically, all N channels are ordered
in a queue, with the initial ordering K.1/ given by a decreasing order of their initial belief val-
ues. In each slot, those K channels at the head of the queue are sensed. Based on the sensing
outcomes, channels are reordered at the end of each slot according to the following simple rules
(see Figure 6.1).

1(G) 1 1(G) 2
Sense 2(B) 3 Sense 2(B) N
3(G) 4 3(G) --
--
4 -- 4 -- reversed
-- --
-- -- --
-- -- --
-- -- 4
-- --
N 1
N 2 N 3
K(t) K(t + 1) K(t) K(t + 1)

Figure 6.1: The semin-universal structure of the Whittle index policy (the left sub-figure is for
p11  p01 , the right for p11 < p01 ; G indicates state 1 (good) and B state 0 (bad)) [135]. Used
with permission.

In the case of p11  p01 , the channels observed in state 1 will stay at the head of the queue
while the channels observed in state 0 will be moved to the end of the queue.
In the case of p11 < p01 , the channels observed in state 0 will stay at the head of the queue
while the channels observed in state 1 will be moved to the end of the queue. The order of the
unobserved channels are reversed.
The above structure of the Whittle index policy was established based on the monotonicity
of the index with the belief value ! , which implies an equivalence between the Whittle index
policy and the myopic policy that activates the arm with the greatest belief value (hence the
greatest immediate reward). The semi-universal structure thus follows from that of the myopic
policy established by Zhao, Krishnamachari, and Liu, 2008 [218]. This structure is intuitively
appealing. In the case of p11 > p01 , the channel states in two consecutive slots are positively
correlated; a good state is more likely to be followed by a good state than a transition to the
bad state. The action is thus to follow the winner. In the case of p11 < p01 , the states in two
120 6. APPLICATION EXAMPLES
consecutive slots are negatively correlated. The policy thus stays with the channel currently in
the bad state and flips the order of unobserved channels.
In addition to computation and memory efficiency, a more significant implication of this
semi-universal structure is that it obviates the need for knowing the channel transition proba-
bilities except the order of p11 and p01 . As a result, Whittle index policy is robust against model
mismatch and automatically tracks variations in the channel model provided that the order of
p11 and p01 remains unchanged. In the case when there is no knowledge on p11 and p01 , the
resulting frequentist bandit model with restless Markov reward processes has a much smaller
learning space to explore as compared to the general case discussed in Section 5.1.2. We explore
this next.
Unknown Markovian dynamics—frequentist bandit with restless Markov reward processes:
When the channel transition probabilities are unknown, the problem can be formulated within
the frequentist framework with restless Markovian reward processes as discussed in Sec-
tion 5.1.2.
The semi-universal structure and the strong performance of the Whittle index policy make
it a perfect proxy for the oracle. Specifically, the Whittle index policy follows one of the two
possible strategies depending on the order of p11 and p01 . These two strategies can thus be
considered as two meta-arms in a bandit model where the objective is to learn which meta-arm
is more rewarding for the underlying unknown Markov dynamics.
A key question in this meta-learning problem is how long to operate each meta-arm at
each step. Since each meta-arm is an arm selection strategy in the original N -arm bandit model,
it needs to be employed for a sufficiently long period in order to gather sufficient observations
on the mean reward it offers. On the other hand, staying with one meta-arm—which can be the
suboptimal one—for too long may leads to poor regret performance. A solution given in Dai et
al., 2011 [71] is to increase the duration of each activation of the meta-arms at an arbitrarily slow
rate. The total reward obtained during each activation (of multiple consecutive time instants) is
used to update an UCB-type of index for each meta-arm, which guides the selection of the next
meta-arm. It was shown that a regret order arbitrarily close to the optimal logarithmic order can
be achieved, with the optimality gap determined by the rate of the diverging sequence for meta-
arm activations which can be chosen based on the performance requirement of the application
at hand.

6.1.2 ADAPTIVE ROUTING UNDER UNKNOWN LINK STATES


Consider a communication network represented by a directed graph consisting of all simple
paths between a given source-destination pair. Let N and M denote, respectively, the number
of paths from the source to the destination and the number of edges involved.
At each time t , a random weight/cost Wi .t/ drawn from an unknown distribution is as-
signed to each edge (i D 1; : : : ; M ). We assume that Wi .t/ is i.i.d. over time for all i . At the
beginning of each time slot t , a path p is chosen for a message transmission. Subsequently, de-
6.1. COMMUNICATION AND COMPUTER NETWORKS 121
pending on the observation model, either the random cost of each link in the chosen path is
revealed (the semi-bandit feedback) or only the total end-to-end cost, given by the sum of the
costs of all links on the path, is observed. The problem is a combinatorial bandit with N arms
dependent on M latent variables (see Figure 5.1).
Let C.p/ denote the total end-to-end cost of path p , and p the optimal route with the
minimum expected cost. The regret of an adaptive routing strategy  is given by
" T #
X
R .T / D E .C.p.t / / C.p // : (6.2)
tD1

The objective is a learning policy that offers good regret scaling in M , the number of unknowns,
rather than N , the number of arms, while preserving the optimal order in T .
Under the link-level observation model, a UCB-˛ index can be maintained for each link
based on link cost observations, and a shortest path algorithm using the UCB-˛ indexes as the
edge weights can be carried out for path selection at each time (Gai et al., 2011 [82]). The more
complex path-level observation model can be handled through a dimension-reduction approach
that constructs a barycentric spanner of the underlying graph. The online policy subsequently
explores only paths in the barycentric spanner and uses the observed costs of these basis paths
to estimate the costs of all paths for exploitation. This approach was first proposed by Awerbuch
and Kleinberg, 2008 [27], under the non-stochastic/adversarial setting and later explored in
the stochastic setting by Liu and Zhao, 2012 [137]. Such a dimension reduction approach also
applies to other combinatorial bandit problems under the strict bandit feedback model.

6.1.3 HEAVY HITTER AND HIERARCHICAL HEAVY HITTER


DETECTION
In Internet and other communication and financial networks, it is a common observation that
a small number of flows, referred to as heavy hitters (HH), account for most of the total traffic.
Quickly identifying the heavy hitters (flows with a packet count per unit time exceeds a given
threshold) is thus crucial to network stability and security. In particular, heavy hitter detection
is important for a variety of network troubleshooting scenarios, including detecting denial-of-
service (DoS) attacks.
The challenge of the problem is that the total number N of flows seen by a router/switch is
much larger than the available sampling resources. Maintaining a packet count of each individual
flow is infeasible. The key to an efficient solution is to consider prefix aggregation based on the
source or destination IP addresses. This naturally leads to a binary tree structure (as illustrated in
Figure 6.2) with each node representing an aggregated flow with a specific IP prefix. The packet
count of each node on the tree equals to the sum of the packet counts of its children.
A more complex version of the problem is hierarchical heavy hitter (HHH) detection,
in which the search for flows with abnormal volume extends to aggregated flows in the upper
levels of the IP-prefix tree. Specifically, an upper-level node is an HHH if its mean remains
122 6. APPLICATION EXAMPLES

0***

00** 01**

000* 001* 010* 011*

0000 0001 0010 0011 0100 0101 0110 0111

Figure 6.2: An IP-prefix tree for heavy hitter detection.

above a given threshold after excluding all its abnormal descendants (if any). HHH detection is
of particular interest in detecting distributed denial-of-service attacks that split total traffic over
multiple routes.
The problem of detecting heavy hitters and hierarchical heavy hitters can be viewed as a
variant of the thresholding bandit with the option of aggregating subset of arms conforming to
a given tree structure. An active inference strategy determines, sequentially, which node on the
tree to probe and when to terminate the search in order to minimize the sample complexity for a
given level of detection reliability. A question of particular interest is how to achieve an optimal
sublinear scaling of the sample complexity with respect to the number of traffic flows.
Vakili et al., 2018 [197], proposed an active inference strategy that induces a biased ran-
dom walk on the tree based on confidence bounds of sample statistics. Specifically, the algorithm
induces a biased random walk that initiates at the root of the tree and eventually arrives and ter-
minates at a target with the required reliability. Each move in the random walk is guided by
the output of a local confidence-bound based sequential test carried on each child of the node
currently being visited by the random walk. The sequential test draws samples from a child
sequentially until either a properly designed upper confidence bound drops below a threshold
(which results in a negative test outcome indicating this node is unlikely to be an ancestor of
a target) or a lower confidence bound exceeds a threshold (resulting in a positive test outcome
indicating this node is likely to be an ancestor of a target). The next move of the random walk is
then determined based on the test outcome: move to the (first) child tested positive and, if both
children tested negative, move back to the parent. The confidence level of the local sequential
test is set to ensure that the random walk is more likely to move toward the target than move
away from it and that the random walk terminates at a true target with a sufficiently high prob-
ability. The sample complexity of this algorithm was shown to achieve the optimal logarithmic
6.2. SOCIAL-ECONOMIC NETWORKS 123
order in the number of flows and the optimal order in the required detection accuracy (Vakili et
al., 2018 [197]).

6.2 SOCIAL-ECONOMIC NETWORKS


6.2.1 DYNAMIC PRICING AND THE PURSUIT OF COMPLETE
LEARNING
The connection between dynamic pricing and the multi-armed bandit model was first explored
by Rothschild in 1974 [172], where he constructed an archetype of rational and optimizing
sellers facing unknown demand functions of their customers. Rothschid suggested that “A firm
which does not know the consequences of charging a particular price has an obvious way of
finding out. It may charge that price and observe the result.” That is the basic premise of bandit
problems.

Dynamic pricing as a Bayesian bandit problem: In a highly stylized model, Rothschild as-
sumed that a firm, trying to price a particular good, sequentially chooses one of two possible
prices p1 and p2 to a stream of potential customers and observes either success or failure in each
sale attempt. The characteristic of each customer is assumed to be probabilistically identical.
More specifically, from the point of view of the firm, each potential customer is a Bernoulli ran-
dom variable with parameter 1 if price p1 is offered and 2 if price p2 is offered. The expected
profit when price pi (i D 1; 2) is offered is i .pi c/, where c is the marginal cost of one unit
of the good. The objective of the firm is a policy for dynamically setting the price to maximize
its long-run cumulative profit.
This is the same two-armed bandit problem as posed by Thompson, 1933 [190] by liken-
ing the two prices as the two treatments in a clinical trail. Rothschild, too, took the Bayesian
approach, assuming prior knowledge on 1 and 2 in the form of their prior distributions over
Œ0; 1. He, however, adopted the total-discounted-reward criterion as in the canonical Bayesian
bandit model defined in Definition 2.2. The main message of Rothschild’s work is that incomplete
learning occurs with positive probability. Specifically, with probability one, the seller charges
only one price infinitely often after a finite duration of price experimentation, and which price
the seller eventually settles on depends on those finite sequence of random sale outcomes during
the price experimentation. Since the same finite sequence of outcomes can occur with posi-
tive probability under both scenarios regarding which of the two arms is more profitable, with
positive probability the seller settles on a suboptimal price and terminates learning.
The result is perhaps not surprising for the following reasons. First, the geometric dis-
counting quickly diminishes the future benefit of an accurate learning of the best arm. Second,
the objective under the Bayesian framework is the expected performance averaged over all pos-
sible values of 1 and 2 under the known prior distributions rather than insisting good perfor-
mance under every realization of 1 and 2 . The optimal strategy is thus to accept the possibility
of settling on the suboptimal arm caused by small probability events in order to latch on the
124 6. APPLICATION EXAMPLES
optimal arm quickly in most cases. Rothschild suggested that such an outcome of incomplete
learning under the optimal policy might explain the pervasive phenomenon of price dispersion
in a market where the same good is being sold at a variety of prices: sellers, each behaving op-
timally, settle on different prices due to differences in their realized sale outcomes during price
experimentation.
A major follow-up work was given by McLennan, 1984 [148], where he showed that
incomplete learning can occur even when the seller has a continuum of price options. The con-
tinuum of prices are, however, bound by a specific demand function that is known to take one of
two possible forms. Specifically, the underlying demand model is either 1 .p/ or 2 .p/, which
characterizes the probability of a successful sale at each possible price p . The two-price model
considered by Rothschild can be viewed under this model by McLennan by restricting offered
prices to the two optimal prices under each possible demand model:

pi D arg max i .p/.p c/: (6.3)


p

The model by Rothschild, however, does not assume the knowledge of the function form of
the underlying demand curve. Even though this does not alter the conclusion on incomplete
learning within the Bayesian framework, it leads to drastically different outcomes within the
frequentist framework as discussed below.

Dynamic pricing as a frequentist bandit problem: The fact that incomplete learning occurs
with positive probability even under the optimal pricing strategy can be unsettling from the
point of view of an individual seller. Seeing only one realization of the demand model and
caring about only its own revenue rather than the ensemble average over a large population of
probabilistically identical sellers, an individual seller might be more interested in a policy that
grantees complete learning under every possible demand model, even though it may not offer
the best ensemble-average performance. This brings us to the frequentist formulation of the
problem.
Take the market model considered by Rothschild. Under the frequentist formulation, the
regret
p grows unboundedly, either at a logarithmic rate under the uniform-dominance setting or
a T rate under the minimax setting. This implies that the optimal strategy, guarding against
every possibility of settling on the suboptimal price, never settles on either price and experi-
menting each price infinitely often (although with vanishing frequency). If we define complete
learning as a probability-one convergence to the optimal price, learning within the frequentist
framework is incomplete with probability one, not just positive probability as in the Bayesian
framework.
If we take the model by McLennan, however, the conclusion is drastically different. Com-
plete learning can be achieved with probability one under the frequentist framework as shown in
Zhai et al., 2011 [217] and Tehrani, Zhai, and Zhao, 2012 [188]. Even if we restrict our actions
to the two optimal prices p1 and p2 under the two possible demand curves, the knowledge on
6.2. SOCIAL-ECONOMIC NETWORKS 125
the specific function forms of the demand curves results in a completely different two-arm ban-
dit problem from that under the Rothschild model. Specifically, since the two arms p1 and p2
share the same underlying demand curve, observations from one arm also provide information
on the quality of the other arm by revealing the underlying demand curve. Recognizing the de-
tection component of this bandit problem, Zhai et al. proposed a policy based on the likelihood
ratio test (LRT): at each time t , the likelihood ratio of seeing the past t 1 sale outcomes under
demand curve 1 to that under 2 is computed, and the price p1 or p2 is offered at time t de-
pending on whether the likelihood ratio is greater or smaller than 1. It was shown in Zhai et al.,
2011 [217] and Tehrani, Zhai, and Zhao, 2012 [188] that this strategy offers bounded regret. In
other words, with probability one, this dynamic pricing strategy converges to the optimal price
within a finite time, thus achieving complete learning. By offering an exploration price—the
price that best distinguishes the two demand curves under a distance measure such as the Cher-
noff distance—when the likelihood ratio is close to 1, regret can be further reduced. It is worth
noting that the LRT policy does not require the complete function forms of 1 or 2 . It only
needs the values of 1 and 2 at the optimal price p1 of 1 and p2 of 2 . Extensions to an arbi-
trary but finite number of possible demand curves are straightforward with minor modifications
to the policy to preserve a bounded regret.
Bandit formulations of dynamic pricing abound in the literature, within both the Bayesian
and the frequentist frameworks and adopting various market models. Representative examples
of the former include the work by Aghion, Bolton, and Jullien, 1991 [3], Aviv and Pazgal,
2005 [26], and Harrison, Keskin, and Zeevi, 2011 [99]. In particular, Harrison, Keskin, and
Zeevi, while adopting the same Bayesian formulation as in McLennan, 1984 [148], examined
the performance of the myopic policy under a regret measure and showed that a modified myopic
policy offers bounded regret. Representative studies within the frequentist framework include
Kleinberg, 2003 [114] and Besbes and Zeevi, 2009 [37], and the references therein.

6.2.2 WEB SEARCH, ADS DISPLAY, AND RECOMMENDATION


SYSTEMS: LEARNING TO RANK
Applications such as Web search, sponsored ads display, and recommendation systems give rise
to the problem of learning a list of a few top-ranked items from a large number of alternatives.
This leads to the so-called ranked bandit model.
Consider the problem of learning the top K most relevant documents for a given query.
Let D denote the set of all documents. Users have different interpretations of the query (con-
sider a query with the keyword “bandit”; users issuing this query may be looking for drastically
different things). Each user i is characterized by the set Ai  D of documents that the user
considers relevant. We refer to Ai as the type of a user.
At each time t , a user arrives, with an unknown type drawn i.i.d. from an unknown dis-
tribution. The learning policy outputs a ranked list of K documents from D. The user scans
the list from the top and clicks the first relevant document, if such documents are presented.
126 6. APPLICATION EXAMPLES
Otherwise, the user abandons the query. This type of user behavior in web search is referred to
as the cascade model (see Craswell, et al. [70]). The objective of the online learning policy is to
minimize the rate of query abandonment in the long run. This can be modeled by assigning a
unit reward for each click and zero reward in the event of abandonment.
In the special case of K D 1, this ranked bandit model reduces to the classic bandit model
with each document constitutes an independent arm. For K > 1, the ranked bandit model can
be considered as an instance of the combinatorial bandit model. A unique feature, however, is
in the feedback. In the event of a query abandonment, it is known that all K documents in the
presented list are irrelevant. When the user clicks the k th document, however, the policy receives
no feedback on the relevance of the documents displayed below the k th document. Since the
order of the chosen K documents affects only the amount of information in the feedback but
not the reward, it might be sufficient to consider an action space consisting of only size-K sets
(rather than ordered sets), coupled with an uncertainty measure that allows the policy to display
documents with high uncertainty levels (thus more exploration value) at the top.
The complexity of this bandit model can be well appreciated when we set to identify the
oracle policy: for a user randomly sampled from a known type distribution, find a set of K
documents that maximizes the click-through probability. Note that for the oracle who does not
need to learn, ranking is irrelevant.
This problem is a variant of the hitting set problem, a classical NP-complete problem.
Specifically, consider a collection of sets fAi g, each a subset of D and with weight pi determined
by the type distribution. The optimal oracle choice S  is a size-K subset of D that maximizes
the total weight of the sets in the collection that are hit by S  :
X  \ 
S  D arg max pi I S Ai ¤ ; : (6.4)
S D WjS jDK
i

The above ranked bandit model was studied by Radlinski, Kleinberg, and Joachims,
2008 [165] and Slivkins, Radlinski, and Gollapudi, 2013 [179]. Streeter and Golovin,
2008 [184] and Streeter, Golovin, and Krause, 2009 [185] considered the problem under gen-
eral reward functions that are monotone and submodular. Kveton et al., 2015 [123] adopted a
more tractable preference model where the user’s preference was assumed to be independence
across documents/items and over time, and referred to the resulting bandit model as the cas-
cading bandits. Like many other directions for extending the canonical bandit model, the field
awaits further exploitation and exploration.
127

Bibliography
[1] Abbasi-Yadkori, Y., Pál, D., and Szepesvári, C. (2011). Improved algorithms for linear
stochastic bandits, Proc. of Conference on Neural Information Processing Systems (NeurIPS),
pages 2312–2320. 96

[2] Agarwal, A., Foster, D. P., Hsu, D., Kakade, S. M., and Rakhlin, A. (2013). Stochastic
convex optimization with bandit feedback, SIAM Journal Optimization, 23:213–240. DOI:
10.1137/110850827. 98

[3] Aghion, P., Bolton, C. H. P., and Jullien, B. (1991). Optimal learning by experimentation,
The Review of Economic Studies, 58:621–654. DOI: 10.2307/2297825. 125

[4] Agrawal, R. (1995a). Sample mean based index policies by O.log n/ regret for the
multi-armed bandit problem, Advances in Applied Probability, 27:1054–1078. DOI:
10.2307/1427934. 5, 65, 73, 75

[5] Agrawal, R. (1995b). The continuum-armed bandit problem, SIAM Journal on Control and
Optimization, 33:1926–1951. DOI: 10.1137/s0363012992237273. 97

[6] Agrawal, S. and Goyal, N. (2012). Analysis of Thompson sampling for the multi-armed
bandit problem, Proc. of Conference on Learning Theory (COLT). 81, 82, 83

[7] Agrawal, S. and Goyal, N. (2013). Further optimal regret bounds for Thompson sampling,
Proc. of the 16th International Conference on Artificial Intelligence and Statistics (AISTATS).
81, 83

[8] Agrawal, R. and Teneketzis, D. (1989). Certainty equivalence control with forcing: Re-
visited. Systems and Control Letters, 13(5):405–412, December 1989. DOI: 10.1016/0167-
6911(89)90107-2. 74

[9] Ahmad, S. H., Liu, M., Javadi, T., Zhao, Q., and Krishnamachari, B. (2009). Optimality
of myopic sensing in multi-channel opportunistic access, IEEE Transactions on Information
Theory, 55:4040–4050. DOI: 10.1109/tit.2009.2025561. 119

[10] Albert, A. E. (1961). The sequential design of experiments for infinitely many states of
nature, The Annals of Mathematics Statistics, 32:774–799. DOI: 10.1214/aoms/1177704973.
111
128 BIBLIOGRAPHY
[11] Allenberg, C., Auer, P., Gyorfi, L., and Ottucsák, G. (2006). Hannan consistency
in on-line learning in case of unbounded losses under partial monitoring, Proc. of
International Conference on Algorithmic Learning Theory (ALT), pages 229–243. DOI:
10.1007/11894841_20. 101

[12] Allesiardo, R. and Feraud, R. (2015). Exp3 with drift detection for the switching ban-
dit problem, Proc. of IEEE International Conference on Data Science and Advanced Analytics
(DSAA). DOI: 10.1109/dsaa.2015.7344834. 91

[13] Alon, N., Cesa-Bianchi, N., Dekel, O., and Koren, T. (2015). Online learning with
feedback graphs: Beyond bandits, Proc. of the 28th Conference on Learning Theory (COLT),
40:23–35. 101

[14] Anandkumar, A., Michael, N., Tang, A. K., and Swami, A. (2011). Distributed algorithms
for learning and cognitive medium access with logarithmic regret, IEEE Journal on Selected
Areas in Communications, 29:731–745. DOI: 10.1109/jsac.2011.110406. 116

[15] Anantharam, V., Varaiya, P., and Walrand, J. (1987a). Asymptotically efficient allocation
rules for the multi-armed bandit problem with multiple plays—Part I: I.I.D. rewards, IEEE
Transactions on Automatic Control, 32:968–975. DOI: 10.1109/TAC.1987.1104491. 88,
115

[16] Anantharam, V., Varaiya, P., and Walrand, J. (1987b). Asymptotically efficient allocation
rules for the multi-armed bandit problem with multiple plays-Part II: Markovian rewards,
IEEE Transaction on Automatic Control, 32:977–982. DOI: 10.1109/tac.1987.1104485. 87

[17] Arrow, K. (1951). Social Choice and Individual Values, John Wiley & Sons. 102

[18] Asawa, M. and Teneketzis, D. (1996). Multi-armed bandits with switching penalties,
IEEE Transactions on Automatic Control, 41:328–348. DOI: 10.1109/9.486316. 52

[19] Audibert, J. and Bubeck, S. (2009). Minimax policies for adversarial and stochastic bandits,
Proc. of the 22nd Annual Conference on Learning Theory (COLT). 79, 93

[20] Audibert, J. and Bubeck, S. (2010). Regret bounds and minimax policies under partial
monitoring, Journal of Machine Learning Research, pages 2785–2836. 101

[21] Audibert, J., Munos, R., and Szepesvári, C. (2009). Exploration-exploitation trade-off
using variance estimates in multi-armed bandits, Theoretical Computer Science, pages 1876–
1902. DOI: 10.1016/j.tcs.2009.01.016. 108

[22] Audibert, J., Bubeck, S., and Munos, R. (2010). Best arm identification in multi-armed
bandits, Proc. of the 23rd Annual Conference on Learning Theory (COLT). 111
BIBLIOGRAPHY 129
[23] Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002). Finite-time analysis of the multi-armed
bandit problem, Machine Learning, 47:235–256. DOI: 10.1023/A:1013689704352. 5, 74,
75, 76
[24] Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R.E. (2003). The non-
stochastic multi-armed bandit problem, SIAM Journal on Computing, 32:48–77. DOI:
10.1137/s0097539701398375. 88, 90, 93
[25] Auer, P., Ortner, R., and Szepesvári, C. (2007). Improved rates for the stochastic
continuum-armed bandit problem, Proc. of the 23rd Annual Conference on Learning Theory
(COLT), pages 454–468. DOI: 10.1007/978-3-540-72927-3_33. 98
[26] Aviv, Y. and Pazgal, A. (2005). A partially observed Markov decision process for dynamic
pricing, Management Science, 51:1400–1416. DOI: 10.1287/mnsc.1050.0393. 125
[27] Awerbuch, B. and Kleinberg, R. (2008). Online linear optimization and adaptive routing,
Journal of Computer and System Sciences, pages 97–114. DOI: 10.1016/j.jcss.2007.04.016.
121
[28] Badanidiyuru, A., Kleinberg, R., and Slivkins, A. (2013). Bandits with knapsacks, IEEE
54th Annual Symposium on Foundations of Computer Science (FOCS), pages 207–216. DOI:
10.1109/focs.2013.30. 99
[29] Banks, J. and Sundaram, R. (1994). Switching costs and the Gittins index, Econometrica,
62:687–694. DOI: 10.2307/2951664. 51
[30] Bellman, R. (1956). A problem in the sequential design of experiments, Sankhia, 16:221–
229. 1, 3, 18, 109
[31] Bellman, R. (1957). Dynamic Programming, Princeton University Press. 54
[32] Ben-Israel, A. and Flåm, S. D. (1990). A bisection/successive approximation method
for computing Gittins indices, Methods and Models of Operations Research, 34:411. DOI:
10.1007/bf01421548. 24
[33] Berry, D. A. and Fristedt, B. (1985). Bandit Problems: Sequential Allocation of Experiments,
Springer. DOI: 10.1007/978-94-015-3711-7. 2, 6, 34
[34] Bertsekas, D. P. (1992). Auction algorithms for network flow problems: A tutorial in-
troduction, Computational Optimization Applications, 1:7–66. DOI: 10.1007/bf00247653.
116
[35] Bertsimas, D. and Nino-Mora, J. (2000). Restless bandits, linear programming re-
laxations and a primal-dual index heuristic, Operations Research, 48:80–90. DOI:
10.1287/opre.48.1.80.12444. 50
130 BIBLIOGRAPHY
[36] Besbes, O., Gur, Y., and Zeevi, A. (2014). Stochastic multi-armed-bandit problem with
non-stationary rewards, Proc. of the 27th International Conference on Neural Information Pro-
cessing Systems (NeurIPS), pages 199–207. 91, 92
[37] Besbes, O. and Zeevi, A. (2009). Dynamic pricing without knowing the demand func-
tion: Risk bounds and near-optimal algorithms, Operations Research, 57:1407–1420. DOI:
10.1287/opre.1080.0640. 125
[38] Bessler, S. (1960). Theory and applications of the sequential design of experiments, k-
actions and infinitely many experiments: Part I–Theory, Technical Representative Applied
Mathematics and Statistics Laboratories, Stanford University. 111
[39] Bistritz, I. and Leshem, A. (2018). Distributed multi-player bandits—a game of thrones
approach, Proc. of the 32nd Conference on Neural Information Processing Systems (NeurIPS).
116
[40] Blackwell, D. (1956). An analog of the minimax theorem for vector payoffs. Pacific Journal
of Mathematics, 6:1–8. DOI: 10.2140/pjm.1956.6.1. 100
[41] Blackwell, D. (1962). Discrete dynamic programming, Annals of Mathematical Statistics,
32:719–726. DOI: 10.1214/aoms/1177704593. 55
[42] Bradfield, J. (2007). Introduction to the Economics of Financial Markets, Oxford University
Press. 105
[43] Bradt, R. N., Johnson, S. M., and Karlin, S. (1956). On sequential designs for maxi-
mizing the sum of n observations, Annals of Mathematical Statistics, 27:1060–1074. DOI:
10.1214/aoms/1177728073. 1, 18, 109
[44] Brown, D. B. and Smith, J. E. (2013). Optimal sequential exploration: Bandits, clairvoy-
ants, and wildcats. Operations Research, 61(3):644–665. 38
[45] Bubeck, S. and Cesa-Bianchi, N. (2012). Regret analysis of stochastic and nonstochas-
tic multi-armed bandit problems, Foundations and Trends in Machine Learning. DOI:
10.1561/2200000024. 66, 79
[46] Bubeck, S., Cesa-Bianchi, N., and Lugosi, G. (2013). Bandits with heavy tail, IEEE
Transactions on Information Theory, 59:7711–7717. DOI: 10.1109/tit.2013.2277869.
[47] Bubeck, S., Munos, R., and Stoltz, G. (2011). Pure exploration in finitely-armed
and continuous-armed bandits, Theoretical Computer Science, 412:1832–1852. DOI:
10.1016/j.tcs.2010.12.059. 112
[48] Bubeck, S., Munos, R., Stoltz, G., and Szepesvari, C. (2011). Online optimization in X-
armed bandits, Journal of Machine Learning Research ( JMLR), 12:1587–1627. 98
BIBLIOGRAPHY 131
[49] Bubeck, S., Stoltz, G., and Yu, J. Y. (2011). Lipschitz bandits without the Lipschitz
constant, Proc. of the 22nd International Conference on Algorithmic Learning Theory (ALT),
pages 144–158. DOI: 10.1007/978-3-642-24412-4_14. 98

[50] Bubeck, S., Cesa-Bianchi, N., and Lugosi, G. (2013). Bandits with heavy tail, IEEE
Transactions on Information Theory, 59:7711–7717. DOI: 10.1109/tit.2013.2277869. 79

[51] Bubeck, S., Li, Y., Peres, Y., and Sellke, M. (2019). Non-stochastic multi-player multi-
armed bandits: Optimal rate with collision information, sublinear without. ArXiv Preprint,
arXiv:1904.12233. 116

[52] Buccapatnam, S., Eryilmaz, A., and Shroff, N. B. (2014). Stochastic bandits with side ob-
servations on networks. ACM SIGMETRICS Performance Evaluation Review, 42(1):289–
300. 100

[53] Bull, A. (2015). Adaptive-treed bandits, Bernoulli Journal of Statistics, 21:2289–2307. DOI:
10.3150/14-bej644. 98

[54] Cappé, O., Garivier, A., Maillard, O., Munos, R., and Stoltz, G. (2013). Kullback–
Leibler upper confidence bounds for optimal sequential allocation, The Annals of Statistics,
41(3):1516–1541. DOI: 10.1214/13-aos1119. 73

[55] Caron, S., Kveton, B., Lelarge, M., and Bhagat, S. (2012). Leveraging side observations
in stochastic bandits, Proc. of the 28th Conference on Uncertainty in Artificial Intelligence,
pages 142–151. 100

[56] Cesa-Bianchi, N., Lugosi, G., and Stoltz, G. (2005). Minimizing regret with label-
efficient prediction, IEEE Transactions on Information Theory, 51:2152–2162. DOI:
10.1109/tit.2005.847729. 101

[57] Cesa-Bianchi, N., Gentile, C., and Zappella, G. (2013). A gang of bandits, Proc. of Ad-
vances in Neural Information Processing Systems (NeurIPS), pages 737–745.

[58] Cesa-Bianchi, N. and Lugosi, G. (2006). Prediction, Learning, and Games, Cambridge
University Press. DOI: 10.1017/cbo9780511546921. 100

[59] Chakravorty, J. and Mahajan, A. (2014). Multi-armed bandits, Gittins index, and
its calculation, Methods and Applications of Statistics in Clinical Trials: Planning, Anal-
ysis, and Inferential Methods, N. Balakrishnan, Ed., John Wiley & Sons, Inc. DOI:
10.1002/9781118596333.ch24. 24

[60] Chen, W., Wang, Y., and Yuan, Y. (2013). Combinatorial multi-armed bandit: General
framework and applications, Proc. of the 30th International Conference on Machine Learning
(ICML), pages 151–159. 95, 96
132 BIBLIOGRAPHY
[61] Chen, Y. and Katehakis, M. (1986). Linear programming for finite state multi-
armed bandit problems, Mathematics of Operations Research, 11:180–183. DOI:
10.1287/moor.11.1.180. 26

[62] Chernoff, H. (1959). Sequential design of experiments, The Annals of Mathematical Statis-
tics, 30:755–770. DOI: 10.1214/aoms/1177706205. 109, 111

[63] Chow, Y. and Lai, T. (1975). Some one-sided theorems on the tail distribution of sample
sums with applications to the last time and largest excess of boundary crossings, Transac-
tions of the American Mathematical Society, 208:51–72. DOI: 10.2307/1997275. 71

[64] Chu, W., Li, L., Reyzin, L., and Schapire, R. E. (2011). Contextual bandits with lin-
ear payoff functions, Proc. of International Conference on Artificial Intelligence and Statistics
(AISTATS). 114

[65] Cohen, A., Hazan, T., and Koren, T. (2016). Online learning with feedback graphs without
the graphs, Proc. of The 33rd International Conference on Machine Learning (ICML). 101

[66] Combes, R. and Proutiere, A. (2014). Unimodal bandits: Regret lower bounds and optimal
algorithms. Proc. of the International Conference on Machine Learning (ICML), pages 521–
529. 98

[67] Combes, R., Jiang, C., and Srikant, R. (2015). Bandits with budgets: Regret lower
bounds and optimal algorithms, Proc. of the ACM SIGMETRICS International Con-
ference on Measurement and Modeling of Computer Systems, pages 245–257. DOI:
10.1145/2745844.2745847. 99

[68] Combes, R., Magureanu, S., and Proutiere, A. (2017). Minimal exploration in structured
stochastic bandits, Proc. of Advances in Neural Information Processing Systems (NeurIPS),
pages 1763–1771. 99

[69] Cope, E. W. (2009). Regret and convergence bounds for a class of continuum-
armed bandit problems, IEEE Transactions on Automatic Control, 54. DOI:
10.1109/tac.2009.2019797. 98

[70] Craswell, N., Zoeter, O., Taylor, M., and Ramsey, B. (2008). An experimental comparison
of click position-bias models, Proc. of the 1st ACM International Conference on Web Search
and Data Mining, pages 87–94. 126

[71] Dai, W., Gai, Y., Krishnamachari, B., and Zhao, Q. (2011). The non-Bayesian restless
multi-armed bandit: A case of near-logarithmic regret, Proc. of the IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2940–2943. DOI:
10.1109/icassp.2011.5946273. 120
BIBLIOGRAPHY 133
[72] Dani, V., Hayes, T. P., and Kakade, S. M. (2008). Stochastic linear optimization under
bandit feedback, Proc. of the Annual Conference on Learning Theory (COLT), pages 355–366.
96, 98

[73] Denardo, E. V., Park, H., and Rothblum, U. G., (2007). Risk-sensitive and risk-neutral
multiarmed bandits, Mathematics of Operations Research. DOI: 10.1287/moor.1060.0240.
26

[74] Denardo, E. V., Feinberg, E. A., and Rothblum, U. G. (2013). The multi-armed bandit,
with constraints, Annals of Operations Research, 1:37–62. DOI: 10.1007/s10479-012-1250-
y. 26

[75] Dumitriu, I., Tetali, P., and Winkler, P. (2003). On playing golf with two balls, SIAM
Journal of Discrete Mathematics, 4:604–615. DOI: 10.1137/s0895480102408341. 52, 53

[76] Ehsan, N. and Liu, M. (2004). On the optimality of an index policy for bandwidth allo-
cation with delayed state observation and differentiated services, IEEE International Con-
ference on Computer Communications (INFOCOM). DOI: 10.1109/infcom.2004.1354606.
50

[77] Even-Dar, E., Kearns, M., and Wortman, J. (2006). Risk-sensitive online learning, Proc.
of the 17th International Conference on Algorithmic Learning Theory (ALT), pages 199–213.
DOI: 10.1007/11894841_18. 108

[78] Even-Dar, E., Mannor, S., and Mansour, Y. (2006). Action elimination and stopping con-
ditions for the multi-armed bandit and reinforcement learning problems, Journal of Machine
Learning Research, 7:1079–1105. 111

[79] Feldman, D. (1962). Contributions to the two-armed bandit problem, Annals of Mathe-
matical Statistics, 3:847–856. DOI: 10.1214/aoms/1177704454. 1

[80] French, K. R., Shwert, G. W., and Stambaugh, R. F. (1987). Expected stock returns and
volatility, Journal of Financial Economics, 19:3–29. DOI: 10.1016/0304-405x(87)90026-2.
105

[81] Gai, Y. and Krishnamachari, B. (2014). Distributed stochastic online learning poli-
cies for opportunistic spectrum access, IEEE Transactions on Signal Processing. DOI:
10.1109/tsp.2014.2360821. 116

[82] Gai, Y., Krishnamachari, B., and Jain, R. (2011). Combinatorial network opti-
mization with unknown variables: Multi-armed bandits with linear rewards and in-
dividual observations, IEEE/ACM Transactions on Networking, 5:1466–1478. DOI:
10.1109/tnet.2011.2181864. 95, 96, 116, 121
134 BIBLIOGRAPHY
[83] Galichet, N., Sebag, M., and Teytaud, O. (2013). Exploration vs. exploitation vs. safety:
Risk-averse multi-armed bandits, Proc. of the Asian Conference on Machine Learning. 108

[84] Garivier, A. and Cappé, O. (2011). The KL-UCB algorithm for bounded stochastic ban-
dits and beyond, Proc. of the Annual Conference on Learning Theory (COLT). 73

[85] Garivier, A. and Moulines, E. (2011). On upper-confidence bound policies for switching
bandit problems, Proc. of the International Conference on Algorithmic Learning Theory (ALT),
pages 174–188. DOI: 10.1007/978-3-642-24412-4_16. 90

[86] Gilbert, E. N. (1960). Capacity of burst-noise channel, Bell System Technical Journal,
39:1253–1265. DOI: 10.1002/j.1538-7305.1960.tb03959.x. 117

[87] Gittins, J. C. and Jones, D. M. (1974). A dynamic allocation index for the sequential design
of experiments, Progress in Statistics, pages 241–266, read at the 1972 European Meeting
of Statisticians, Budapest. 3, 11, 13, 22, 56

[88] Gittins, J. C. (1979). Bandit processes and dynamic allocation indices, Journal of the Royal
Statistical Society, 2:148–177. DOI: 10.1111/j.2517-6161.1979.tb01068.x. 3, 13, 22

[89] Gittins, J. C. (1989). Multi-Armed Bandit Allocation Indices, John Wiley & Sons, Inc. DOI:
10.1002/9780470980033. 2

[90] Gittins, J. C., Glazebrook, K. D., and Weber, R. R. (2011). Multi-Armed Bandit Allocation
Indices, John Wiley & Sons, Inc. DOI: 10.1002/9780470980033. 4, 13, 15, 22, 29, 34, 40

[91] Glazebrook, K. D. (1976). Stochastic scheduling with order constraints, International Jour-
nal of Systems Science, 6:657–666. DOI: 10.1080/00207727608941950. 41

[92] Glazebrook, K. D. (1979). Stoppable families of alternative bandit processes, Journal of


Applied Probability, 16(4):843–854. DOI: 10.2307/3213150. 38

[93] Glazebrook, K. D. (1982). On a sufficient condition for superprocesses due to Whittle.


Journal of Applied Probability, pages 99–110. 38

[94] Glazebrook, K. D., Ruiz-Hernandez, D., and Kirkbride, C. (2006). Some indexable fam-
ilies of restless bandit problems, Advances in Applied Probability, pages 643–672. DOI:
10.1239/aap/1158684996. 46, 50

[95] Glazebrook, K. D. and Ruiz-Hernández, D. (2005). A restless bandit approach to


stochastic scheduling problems with switching costs. https://www.researchgate.net
/publication/252641987_A_Restless_Bandit_Approach_to_Stochastic_Schedul
ing_Problems_with_Switching_Costs 52
BIBLIOGRAPHY 135
[96] Hadfield-Menell, D. and Russell, S. (2015). Multitasking: Efficient optimal planning for
bandit superprocesses. Proc. of the 21st Conference on Uncertainty in Artificial Intelligence,
pages 345–354. 38

[97] Hanawal, M. K. and Saligrama, V. (2015). Efficient detection and localization on graph
structured data, Proc. of the IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), pages 5590–5594. DOI: 10.1109/icassp.2015.7179041. 96, 97

[98] Hannan, J. (1957). Approximation to Rayes risk in repeated play, Contributions to the Theory
of Games, 3:97–139. DOI: 10.1515/9781400882151-006. 61, 100

[99] Harrison, J. M., Keskin, N. B., and Zeevi, A. (2011). Bayesian dynamic pricing policies:
Learning and earning under a binary prior distribution, Management Science, pages 1–17.
DOI: 10.1287/mnsc.1110.1426. 125

[100] Hartland, C., Baskiotis, N., Gelly, S., Sebag, M., and Teytaud, O. (2007). Change point
detection and meta-bandits for online learning in dynamic environments, CAp, cepadues,
pages 237–250. 91

[101] Helmbold, D. and Panizza, S. (1997). Some label efficient learning results, Proc. of the
Annual Conference on Learning Theory (COLT). DOI: 10.1145/267460.267502. 101

[102] Honda, J. and Takemura, A. (2014). Optimality of Thompson sampling for Gaussian
bandits depends on priors. Proc. of the 17th International Conference on Artificial Intelligence
and Statistics (AISTATS). 83

[103] Jiang, C. and Srikant, R. (2003). Bandits with budgets, Proc. of the IEEE Conference on
Decision and Control (CDC), pages 5345–5350. DOI: 10.1109/cdc.2013.6760730. 99

[104] Jun, T. (2004). A survey on the bandit problem with switching costs, De Economist, 4:513–
541. DOI: 10.1007/s10645-004-2477-z. 52

[105] Kalathil, D., Nayyar, N., and Jain, R. (2014). Decentralized learning for multi-
player multi-armed bandits, IEEE Transactions on Information Theory. DOI:
10.1109/cdc.2012.6426587. 116

[106] Kallenberg, L. C. M. (1986). A note on M. N. Katehakis’ and Y.-R. Chen’s com-


putation of the Gittins index. Mathematics of Operations Research, 11(1):184–186. DOI:
10.1287/moor.11.1.184. 26

[107] Kanade, V., McMahan, B., and Bryan, B. (2009). Sleeping experts and bandits with
stochastic action availability and adversarial rewards, Proc. of the 12th International Confer-
ence on Artificial Intelligence and Statistics (AISTATS). 99
136 BIBLIOGRAPHY
[108] Kaufmann, E. (2018). On Bayesian index policies for sequential resource allocation, The
Annals of Statistics, 46(2):842–865. DOI: 10.1214/17-aos1569. 83

[109] Kaufmann, E., Cappe, O., and Garivier, A. (2012). On Bayesian upper confidence
bounds for bandit problems, Proc. of the 15th International Conference on Artificial Intel-
ligence and Statistics (AISTAT). 83

[110] Kaufmann, E., Korda, N., and Munos, R. (2012). Thompson sampling: An optimal finite
time analysis, Proc. of International Conference on Algorithmic Learning Theory (ALT). 83

[111] Katehakis, M. N. and Rothblum, U. G. (1996). Finite state multi-armed bandit prob-
lems: Sensitive-discount, average-reward and average-overtaking optimality, The Annals of
Applied Probability, 6(3):1024–1034. DOI: 10.1214/aoap/1034968239. 55

[112] Katehakis, M. and Veinott, A. (1987). The multi-armed bandit problem: Decom-
position and computation, Mathematics of Operations Research, 12(2):262–268. DOI:
10.1287/moor.12.2.262. 19

[113] Kelly, F. P. (1979). Discussion of Gittins’ paper, Journal of the Royal Statistical Society,
41(2):148–177. 54

[114] Kleinberg, R. (2003). The value of knowing a demand curve: Bounds on regret for online
posted-price auctions, Proc. of the 44th IEEE Symposium on Foundations of Computer Science
(FOCS), pages 594–605. DOI: 10.1109/sfcs.2003.1238232. 125

[115] Kleinberg, R. (2004). Nearly tight bounds for the continuum-armed bandit prob-
lem, Proc. of the 17th International Conference on Neural Information Processing Systems,
pages 697–704. 97

[116] Kleinberg, R., Niculescu-Mizil, A., and Sharma, Y. (2008). Regret bounds for sleep-
ing experts and bandits, Proc. of the Annual Conference on Learning Theory (COLT). DOI:
10.1007/s10994-010-5178-7. 99

[117] Kleinberg, R., Slivkins, A., and Upfal, E. (2015). Bandits and experts in metric spaces.
https://arxiv.org/abs/1312.1277 DOI: 10.1145/3299873. 98, 114

[118] Klimov, G. P. (1974). Time-sharing service systems, Teor. Veroyatnost. i Primenen.,


19(3):558–576, Theory of Probability and its Applications, 19(3):532–551, 1975. DOI:
10.1137/1119060 .

[119] Komiyama, J., Honda, J., Kashima, H., and Nakagawa, H. (2015). Regret lower bound
and optimal algorithm in dueling bandit problem, Proc. of Conference on Learning Theory
(COLT). 103
BIBLIOGRAPHY 137
[120] Korda, N., Kaufmann, E., and Munos, R. (2013). Thompson sampling for 1-dimensional
exponential family bandits, Proc. of Advances in Neural Information Processing Systems
(NeurIPS). 83

[121] Krishnamurthy, V. and Wahlberg, B. (2009). Partially observed Markov decision process
multiarmed bandits—structural results, Mathematics of Operations Research, 34(2):287–302.
DOI: 10.1287/moor.1080.0371. 24

[122] Kuhn, H. W. (1955). The Hungarian method for the assignment problem, Naval Res.
Logistics Quart., 2(1–2):83–97. 116

[123] Kveton, B., Szepesvari, C., Wen, Z., and Ashkan, A. (2015). Cascading bandits: learning
to rank in the cascade model, Proc. of the 32nd International Conference on Machine Learning
(ICML). 126

[124] Kveton, B., Wen, Z., Ashkan, A., Ashkan, A., and Szepesvári, C. (2015). Tight regret
bounds for stochastic combinatorial semi-bandits, Proc. of the 18th International Conference
on Artificial Intelligence and Statistics (AISTATS), pages 535–543. 95, 96

[125] Lai, T. L. (1987). Adaptive treatment allocation and the multi-armed bandit problem,
The Annals of Statistics, 15(3):1091–1114. DOI: 10.1214/aos/1176350495. 73, 80

[126] Lai, T. L. and Robbins, H. (1985). Asymptotically efficient adaptive allocation rules,
Advances in Applied Mathematics, 6(1):4–22. DOI: 10.1016/0196-8858(85)90002-8. 5,
71, 72, 87, 88, 106, 115

[127] Lai, T. L. and Ying, Z. (1988). Open bandit processes and optimal scheduling of queueing
networks, Advances in Applied Probability, 20(2):447–472. DOI: 10.2307/1427399. 42

[128] Langford, J. and Zhang, T. (2007). The epoch-greedy algorithm for contextual multi-
armed bandits, Proc. of Advances in Neural Information Processing Systems (NeurIPS). 114

[129] Lattimore, T. and Szepesvári, C. (2019). Bandit Algorithms, To be published by Cam-


bridge University Press. https://tor-lattimore.com/downloads/book/book.pdf 79,
93

[130] Le Ny, J., Dahleh, M., and Feron, E. (2008). Multi-UAV dynamic routing with partial
observations using restless bandit allocation indices, Proc. of the American Control Confer-
ence. DOI: 10.1109/acc.2008.4587156. 117, 118

[131] Li, L., Chu, W., Langford, J., and Schapire, R. E. (2010). A contextual-bandit ap-
proach to personalized news article recommendation, WWW, pages 661–670. DOI:
10.1145/1772690.1772758. 114
138 BIBLIOGRAPHY
[132] Liu, H., Liu, K., and Zhao, Q. (2013). Learning in a changing world: Restless multiarmed
bandit with unknown dynamics, IEEE Transactions on Information Theory, 59(3):1902–
1916. DOI: 10.1109/tit.2012.2230215. 89

[133] Liu, F., Lee, J., and Shroff, N. B. (2018). A change-detection based framework for
piecewise-stationary multi-armed bandit problem, Proc. of the AAAI Conference on Artificial
Intelligence. 91

[134] Liu, K. and Zhao, Q. (2008). A restless bandit formulation of opportunistic access: In-
dexablity and index policy, Proc. of the 5th IEEE Conference on Sensor, Mesh and Ad Hoc
Communications and Networks (SECON) Workshops. DOI: 10.1109/sahcnw.2008.12. 118

[135] Liu, K. and Zhao, Q. (2010). Indexability of restless bandit problems and optimality of
Whittle index for dynamic multichannel access, IEEE Transactions on Information Theory,
56(11):5547–5567. 118, 119

[136] Liu, K. and Zhao, Q. (2010). Distributed learning in multi-armed bandit with
multiple players, IEEE Transactions on Signal Processing, 58(11):5667–5681. DOI:
10.1109/tsp.2010.2062509. 116

[137] Liu, K. and Zhao, Q. (2012). Adaptive shortest-path routing under unknown and
stochastically varying link states, IEEE International Symposium on Modeling and Optimiza-
tion in Mobile, Ad Hoc and Wireless Networks (WiOpt), pages 232–237. 95, 96, 121

[138] Locatelli, A., Gutzeit, M., and Carpentier, A. (2016). An optimal algorithm for the
thresholding bandit problem, Proc. of the 33rd International Conference on Machine Learn-
ing (ICML), 48:1690–1698. 112

[139] Lott, C. and Teneketzis, D. (2000). On the optimality of an index rule in multi-channel
allocation for single-hop mobile networks with multiple service classes, Probability in the
Engineering and Informational Sciences, 14:259–297. DOI: 10.1017/s0269964800143013.
50

[140] Maillard, O., Munos, R., and Stoltz, G. (2011). Finite-time analysis of multi-armed ban-
dits problems with Kullback–Leibler divergences, Proc. of Conference On Learning Theory
(COLT). 73

[141] Maillard, O. (2013). Robust risk-averse stochastic multi-armed bandits, Proc. of the
International Conference on Algorithmic Learning Theory (ALT), 8139:218–233. DOI:
10.1007/978-3-642-40935-6_16. 108

[142] Mandelbaum, A. (1986). Discrete multi-armed bandits and multi-parameter processes,


Probability Theory and Related Fields, 71(1):129–147. DOI: 10.1007/bf00366276. 20, 21
BIBLIOGRAPHY 139
[143] Mandelbaum, A. (1987). Continuous multi-armed bandits and multiparameter pro-
cesses, The Annals of Probability, 15(4):1527–1556. DOI: 10.1214/aop/1176991992.

[144] Mannor, S. and Shamir, O. (2011). From bandits to experts: On the value of side-
observations, Proc. of Advances in Neural Information Processing Systems (NeurIPS), 24:684–
692. 100

[145] Mannor, S. and Tsitsiklis, J. N. (2004). The sample complexity of exploration in the multi-
armed bandit problem, Journal of Machine Learning Research, 5:623–648. 111

[146] Marden, J. R., Young, H. P., and Pao L. Y. (2014). Achieving pareto optimality through
distributed learning, SIAM Journal on Control and Optimization, 52(5):2753–2770. DOI:
10.1137/110850694. 116

[147] Markowitz, H. (1952). Portfolio selection. The Journal of Finance, 7(1):77–91. 103

[148] McLennan, A. (1984). Price dispersion and incomplete learning in the long run, Journal
of Economic Dynamics and Control, 7(3):331–347. DOI: 10.1016/0165-1889(84)90023-x.
124, 125

[149] Mellor, J. and Shapiro, J. (2013). Thompson sampling in switching environments with
Bayesian online change detection, Proc. of the 16th International Conference on Artificial In-
telligence and Statistics (AISTATS), pages 442–450. 91

[150] Minsker, S. (2013). Estimation of extreme values and associated level sets of a regression
function via selective sampling, Proc. of the 26th Conference on Learning Theory (COLT),
pages 105–121. 98

[151] Naghshvar, M. and Javidi, T. (2013). Active sequential hypothesis testing, Annals of
Statistics, 41(6). DOI: 10.1214/13-aos1144. 111

[152] Nain, P., Tsoucas, P., and Walrand, J. (1989). Interchange arguments in stochastic
scheduling, Journal of Applied Probability, 26(4):815–826. DOI: 10.21236/ada454729.

[153] Nash, P. (1973). Optimal allocation of resources between research projects, Ph.D. thesis,
Cambridge University. 42

[154] Niño-Mora, J. (2001). Restless bandits, partial conservation laws and indexability, Ad-
vances in Applied Probability, 33:76–98. DOI: 10.1017/S0001867800010648. 46, 50

[155] Niño-Mora, J. (2002). Dynamic allocation indices for restless projects and queueing ad-
mission control: A polyhedral approach, Mathematical Programming, 93:361–413. DOI:
10.1007/s10107-002-0362-6.
140 BIBLIOGRAPHY
[156] Niño-Mora, J. (2006). Restless bandit marginal productivity indices, diminishing returns
and optimal control of make-to-order/make-to-stock M/G/1 queues, Mathematics Opera-
tions Research, 31:50–84. DOI: 10.1287/moor.1050.0165. 50

[157] Niño-Mora, J. (2007a). A .2=3/n3 fast-pivoting algorithm for the Gittins index and op-
timal stopping of a Markov chain, INFORMS Journal on Computing, 19(4):485–664. DOI:
10.1287/ijoc.1060.0206. 26

[158] Niño-Mora, J. (2007b). Dynamic priority allocation via restless bandit marginal produc-
tivity indices, TOP, 15:161–198. DOI: 10.1007/s11750-007-0025-0. 46, 50

[159] Niño-Mora, J. (2008). A faster index algorithm and a computational study for
bandits with switching costs, INFORMS Journal of Computation, 20:255–269. DOI:
10.1287/ijoc.1070.0238. 52

[160] Niño-Mora, J. (2011). Computing a classic index for finite-horizon bandits, INFORMS
Journal of Computation, 23:254–267. DOI: 10.1287/ijoc.1100.0398. 56

[161] Nitinawarat, S., Atia, G. K., and Veeravalli, V. V. (2013). Controlled sensing for mul-
tihypothesis testing, IEEE Transactions on Automatic Control, 58(10):2451–2464. DOI:
10.1109/tac.2013.2261188. 111

[162] Papadimitriou, C. H. and Tsitsiklis, J. N. (1999). The complexity of optimal


queueing network control, Mathematics of Operations Research, 24(2):293–305. DOI:
10.1109/sct.1994.315792. 88

[163] Pradelski, B. S. and Young, H. P. (2012). Learning efficient Nash equilibria in distributed
systems, Games and Economic Behavior, 75(2):882–897. DOI: 10.1016/j.geb.2012.02.017.
116

[164] Puterman, M. L. (2005). Markov Decision Processes: Discrete Stochastic Dynamic Program-
ming, Wiley. DOI: 10.1002/9780470316887. 7, 8, 10, 27, 52

[165] Radlinski, F., Kleinberg, R., and Joachims, T. (2008). Learning diverse rankings
with multiarmed bandits, Proc. of International Conference on Machine Learning (ICML),
pages 784–791. DOI: 10.1145/1390156.1390255. 126

[166] Raghunathan, V., Borkar, V., Cao, M., and Kumar, P. R. (2008). Index policies for real-
time multicast scheduling for wireless broadcast systems, Proc. of IEEE International Con-
ference on Computer Communications (INFOCOM). DOI: 10.1109/infocom.2007.217. 50

[167] Rigollet, P. and Zeevi, A. (2010). Nonparametric bandits with covariates, Proc. of the
Annual Conference on Learning Theory (COLT). 114
BIBLIOGRAPHY 141
[168] Robbins, H. (1952). Some aspects of the sequential design of experiments, Bulletin of the
American Mathematical Society, 58(5):527–535. DOI: 10.1090/s0002-9904-1952-09620-
8. 1, 5, 74, 109

[169] Rosenski, J., Shamir, O., and Szlak, L. (2016). Multi-player bandits—a musical chairs
approach, Proc. of International Conference on Machine Learning (ICML), pages 155–163.
116

[170] Ross, S. M. (1970). Applied Probability Models with Optimization Applications, Holden
Day. 27

[171] Ross, S. M. (1995). Introduction to Stochastic Dynamic Programming, Academic Press.


DOI: 10.1016/C2013-0-11415-8. 7

[172] Rothschild, M. (1974). A two-armed bandit theory of market pricing, Journal of Economic
Theory, 9(2):185–202. DOI: 10.1016/0022-0531(74)90066-0. 123

[173] Rusmevichientong, P. and Tsitsiklis, J. N. (2010). Linearly parameterized bandits, Math-


ematics of Operations Research, 35(2):395–411. DOI: 10.1287/moor.1100.0446. 96

[174] Salomon, A. and Audibert, J. Y. (2011). Deviations of stochastic bandit regret, Proc. of
the International Conference on Algorithmic Learning Theory (ALT). DOI: 10.1007/978-3-
642-24412-4_15. 108

[175] Sani, A., Lazaric, A., and Munos, R. (2012). Risk aversion in multi-armed bandits, Proc.
of Neural Information Processing Systems (NeurIPS). 105, 107

[176] Seldin, Y., Auer, P., Laviolette, F., Shawe-Taylor, J., and Ortner, R. (2011). PAC-
Bayesian analysis of contextual bandits, Proc. of Advances in Neural Information Processing
Systems (NeurIPS). 114

[177] Slivkins, A. (2011). Multi-armed bandits on implicit metric spaces, Proc. of Advances in
Neural Information Processing Systems (NeurIPS). 98

[178] Slivkins, A. (2014). Contextual bandits with similarity information, Journal of Machine
Learning Research, 15:2533–2568. 114

[179] Slivkins, A., Radlinski, F., and Gollapudi, S. (2013). Ranked bandits in metric spaces:
Learning diverse rankings over large document collections, Journal of Machine Learning
Research. 126

[180] Slivkins, A. and Upfal, E. (2008). Adapting to a changing environment: The Brownian
restless bandits, Proc. of the Annual Conference on Learning Theory (COLT). 91
142 BIBLIOGRAPHY
[181] Smallwood, R. and Sondik, E. (1971). The optimal control of partially ovservable
Markov processes over a finite horizon, Operations Research, pages 1071–1088. DOI:
10.1287/opre.21.5.1071. 118

[182] Sonin, I. M. (2008). A generalized Gittins index for a Markov chain and its recursive cal-
culation, Statistics and Probability Letters, 78:1526–1533. DOI: 10.1016/j.spl.2008.01.049.
26

[183] Stoltz, G. (2005). Incomplete information and internal regret in prediction of individual
sequences. Ph.D. thesis, Université Paris Sud-Paris XI. 93

[184] Streeter, M. and Golovin, D. (2008). An online algorithm for maximizing submodu-
lar functions, Proc. of Advances in Neural Information Processing Systems (NeurIPS). DOI:
10.21236/ada476748. 126

[185] Streeter, M., Golovin, D., and Krause, A. (2009). Online learning of assignments, Proc.
of Advances in Neural Information Processing Systems (NeurIPS). 126

[186] Sundaram, R. K. (2005). Generalized bandit problems, Social Choice and Strategic De-
cisions: Studies in Choice and Welfare, Austen-Smith, D. and Duggan, J., Eds., Springer,
Berlin, Heidelberg. DOI: 10.1007/3-540-27295-x_6. 52

[187] Sutton, R. and Barto, A. (1998). Reinforcement Learning, an Introduction. Cambridge,


MIT Press/Bradford Books. 74

[188] Tehrani, P., Zhai, Y., and Zhao, Q. (2012). Dynamic pricing under finite space de-
mand uncertainty: A multi-armed bandit with dependent arms. https://arxiv.org/ab
s/1206.5345 124, 125

[189] Tekin, C. and Liu, M. (2012). Online learning of rested and restless bandits, IEEE Trans-
action on Information Theory, 58(8):5588–5611. DOI: 10.1109/tit.2012.2198613. 88, 89

[190] Thompson, W. R. (1933). On the likelihood that one unknown probability exceeds
another in view of the evidence of two samples, Biometrika, 25(3/4):275–294. DOI:
10.1093/biomet/25.3-4.285. 1, 12, 81, 109, 123

[191] Tran-Thanh, L., Chapman, A., de Cote, E. M., Rogers, A., and Jennings, N. R. (2010).
Epsilon-first policies for budget-limited multi-armed bandits, Proc. of the AAAI Conference
on Artificial Intelligence. 99

[192] Tran-Thanh, L., Chapman, A., Rogers, A., and Jennings, N. R. (2012). Knapsack based
optimal policies for budget-limited multi-armed bandits, Proc. of the AAAI Conference on
Artificial Intelligence. 99
BIBLIOGRAPHY 143
[193] Vakili, S., Boukouvalas, A., and Zhao, Q. (2019). Decision variance in risk-averse online
learning, Proc. of the IEEE Conference on Decision and Control (CDC). 106

[194] Vakili, S., Liu, K., and Zhao, Q. (2013). Deterministic sequencing of exploration and
exploitation for multi-armed bandit problems, IEEE Journal of Selected Topics in Signal Pro-
cessing, 7(5):759–767. DOI: 10.1109/jstsp.2013.2263494. 74, 75, 79, 116

[195] Vakili, S. and Zhao, Q. (2015). Mean-variance and value at risk in multi-armed ban-
dit problems, Proc. of the 53rd Annual Allerton Conference on Communication, Control, and
Computing. DOI: 10.1109/allerton.2015.7447162. 106, 108

[196] Vakili, S. and Zhao, Q. (2016). Risk-averse multi-armed bandit problems under mean-
variance measure, IEEE Journal of Selected Topics in Signal Processing, 10(6):1093–1111.
DOI: 10.1109/jstsp.2016.2592622. 105, 106, 107

[197] Vakili, S., Zhao, Q., Liu, C., and Chuah, C. N. (2018). Hierarchical heavy hitter detec-
tion under unknown models, Proc. of IEEE International Conference on Acoustics, Speech, and
Signal Processing (ICASSP). 122, 123

[198] Valko, M., Carpentier, A., and Munos, R. (2013). Stochastic simultaneous optimistic
optimization, Proc. of the 30th International Conference on Machine Learning (ICML),
pages 19–27. 98

[199] Valko, M., Munos, R., Kveton, B., and Kocák, T. (2014). Spectral bandits for smooth
graph functions, Proc. of the 31th International Conference on Machine Learning (ICML).
96, 97

[200] Varaiya, P., Walrand, J., and Buyukkoc, C. (1985). Extensions of the multiarmed bandit
problem: The discounted case, IEEE Transactions on Automatic Control, 30(5):426–439.
DOI: 10.1109/tac.1985.1103989. 21, 24

[201] Veatch, M. L. and Wein, M. (1996). Scheduling a make-to-stock queue: Index policies
and hedging points, Operations Research, 44:634–647. DOI: 10.1287/opre.44.4.634. 50

[202] Veinott, A. F., JR. (1966). On finding optimal policies in discrete dynamic pro-
gramming with no discounting, Annals of Mathematical Statistics, 37:1284–1294. DOI:
10.1214/aoms/1177699272. 55

[203] Vogel, W. (1960a). A sequential design for the two armed bandit, The Annals of Mathe-
matical Statistics, 31(2):430–443. DOI: 10.1214/aoms/1177705906. 1

[204] Vogel, W. (1960b). An asymptotic minimax theorem for the two armed bandit problem,
Annals of Mathematical Statistics, 31(2):444–451. DOI: 10.1214/aoms/1177705907. 1, 6,
66
144 BIBLIOGRAPHY
[205] Wang, C. C., Kulkarni, S. R., and Poor, H. V. (2005). Bandit problems with
side observations, IEEE Transactions on Automatic Control, 50(3):338–355. DOI:
10.1109/tac.2005.844079. 113

[206] Weber, R. R. and Weiss, G. (1990). On an index policy for restless bandits, Journal of
Applied Probability, 27(3):637–648. DOI: 10.2307/3214547. 4, 50

[207] Weber, R. R. and Weiss, G. (1991). Addendum to on an index policy for restless bandits,
Advances in Applied Probability, 23(2):429–430. DOI: 10.2307/1427757. 50

[208] Weber, R. R. (1992). On the Gittins index for multiarmed bandits, The Annals of Applied
Probability, 2(4):1024–1033. DOI: 10.1214/aoap/1177005588. 4, 18, 22

[209] Weiss, G. (1988). Branching bandit processes, Probability in the Engineering and Infor-
mational Sciences, 2(3):269–278. DOI: 10.1017/s0269964800000826. 42

[210] Whittle, P. (1980). Multi-armed bandits and the Gittins index, Journal of the Royal Sta-
tistical Society, series B, 42(2):143–149. DOI: 10.1111/j.2517-6161.1980.tb01111.x. 19,
37, 38

[211] Whittle, P. (1981). Arm-acquiring bandits, The Annals of Probability, 9(2):284–292. DOI:
10.1214/aop/1176994469. 42

[212] Whittle, P. (1988). Restless bandits: Activity allocation in a changing world, Journal of
Applied Probability, 25:287–298. DOI: 10.1017/s0021900200040420. 4, 34, 42, 46

[213] Wu, H. and Liu, X. (2016). Double Thompson sampling for dueling bandits, Proc. of
Advances in Neural Information Processing Systems (NeurIPS). 103

[214] Xu, X., Vakili, S., Zhao, Q., and Swami, A. (2019). Multi-armed bandits on partially
revealed unit interval graphs, IEEE Transactions on Network Science and Engineering. DOI:
10.1109/TNSE.2019.2935256. 98

[215] Yakowitz, S. and Lowe, W. (1991). Nonparametric bandit methods. Annals of Operations
Research, 28(1–4):297–312. DOI: 10.1007/BF02055587. 74

[216] Yue, Y. Broder, J., Kleinberg, R., and Joachims, T. (2012). The K-armed duel-
ing bandits problem, Journal of Computer and System Sciences, 78:1538–1556. DOI:
10.1016/j.jcss.2011.12.028. 102

[217] Zhai, Y., Tehrani, P., Li, L., Zhao, J., and Zhao, Q. (2011). Dynamic pricing under binary
demand uncertainty: A multi-armed bandit with correlated arms, Proc. of the 45th IEEE
Asilomar Conference on Signals, Systems, and Computers. DOI: 10.1109/acssc.2011.6190289.
124, 125
BIBLIOGRAPHY 145
[218] Zhao, Q., Krishnamachari, B., and Liu, K. (2008). On myopic sensing for multi-channel
opportunistic access: Structure, optimality, and performance, IEEE Transactions on Wireless
Communications, 7(12):5431–5440. DOI: 10.1109/t-wc.2008.071349. 119
[219] Zhao, Q. and Sadler, B. (2007). A survey of dynamic spectrum access, IEEE Signal Pro-
cessing Magazine, 24(3):79–89. DOI: 10.1109/msp.2007.361604. 117
[220] Zoghi, M., Whiteson, S., Munos, R., and de Rijke, M. (2014). Relative upper confidence
bound for the k-armed dueling bandit problem, Proc. of International Conference on Machine
Learning (ICML), pages 10–18. 103
[221] Zoghi, M., Karnin, Z. S., Whiteson, S., and de Rijke, M. (2015). Copeland dueling
bandits, Proc. of Advances in Neural Information Processing Systems (NeurIPS), pages 307–
315. 103
147

Author’s Biography
QING ZHAO
Qing Zhao is a Joseph C. Ford Professor of Engineering at Cornell University. Prior to that,
she was a Professor in the ECE Department at University of California, Davis. She received a
Ph.D. in Electrical Engineering from Cornell in 2001. Her research interests include sequential
decision theory, stochastic optimization, machine learning, and algorithmic theory with appli-
cations in infrastructure, communications, and social-economic networks. She is a Fellow of
IEEE, a Distinguished Lecturer of the IEEE Signal Processing Society, a Marie Skłodowska-
Curie Fellow of the European Union Research and Innovation program, and a Jubilee Chair
Professor of Chalmers University during her 2018–2019 sabbatical leave. She received the 2010
IEEE Signal Processing Magazine Best Paper Award and the 2000 Young Author Best Paper
Award from IEEE Signal Processing Society.

You might also like