Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

Motivation POMDP Bayesian RL Experiments Conclusion

Bayesian Reinforcement Learning in


Continuous POMDPs

Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1

1
School of Computer Science, McGill University, Canada
2
Department of Computer Science, Laval University, Canada

May 23rd , 2008

Bayesian RL in Continuous POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 1 / 17
Motivation POMDP Bayesian RL Experiments Conclusion

Motivation

Robots have to make decisions under :


Imperfect Actuators
Noisy Sensors
Poor/Approximate Model

How to maximize long-term rewards ?


[Rottmann]

Bayesian RL in Continuous POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 2 / 17
Motivation POMDP Bayesian RL Experiments Conclusion

Continuous POMDP

States : S ⊆ Rm
Actions : A ⊆ Rn
Observations : Z ⊆ Rp
Rewards : R(s, a) ∈ R

Gaussian model for Transition/Observation function :


st = gT (st−1 , at−1 , Xt ) Xt ∼ N(µX , ΣX )
zt = gO (st , at−1 , Yt ) Yt ∼ N(µY , ΣY )

Bayesian RL in Continuous POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 3 / 17
Motivation POMDP Bayesian RL Experiments Conclusion

Example

Simple Robot Navigation Task :

x0
      
x cos θ − sin θ X1
= +v
y0 y sin θ cos θ X2
     
zx x Y1
= +
zy y Y2

+1 reward when ||s − sGOAL ||2 < d

Bayesian RL in Continuous POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 4 / 17
Motivation POMDP Bayesian RL Experiments Conclusion

Problem

In practice : µX , ΣX , µY , ΣY unknown.
Need to trade-off between :
Learning the model
Identifying the state
Gathering rewards

Bayesian RL in Continuous POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 5 / 17
Motivation POMDP Bayesian RL Experiments Conclusion

Bayesian Reinforcement Learning

Current Prior/
Posterior

Action

Observation

New Posterior

Bayesian RL in Continuous POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 6 / 17
Motivation POMDP Bayesian RL Experiments Conclusion

Bayesian Reinforcement Learning

Planning problem representable as a new POMDP :


States : (s, θ)
Actions : a ∈ A
Observations : z ∈ Z
Rewards : R(s, θ, a) = R(s, a)

Joint Transition-Observation Probabilities :


Pr(s0 , θ0 , z|s, θ, a) = Pr(s0 , z|s, a, θ)Iθ (θ0 )

Bayesian RL in Continuous POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 7 / 17
Motivation POMDP Bayesian RL Experiments Conclusion

Bayesian Reinforcement Learning

Belief State = Posterior


Belief Update R :
baz (s0 , θ) ∝ S b(s, θ) Pr(s0 , z|s, a, θ)ds
Optimal policy by solving :
V ∗ (b) =R
maxa∈A S R(s, a) Pr(s|b)ds + γ Z Pr(z|b, a)V ∗ (baz )dz
R 

Bayesian RL in Continuous POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 8 / 17
Motivation POMDP Bayesian RL Experiments Conclusion

Belief Update

Bayesian Learning of (µ, Σ) :


Normal-Wishart prior ⇒ Normal-Wishart posterior
Parametrized by (n, µ̂, Σ̂)

Start with prior : (n0 , µ̂0 , Σ̂0 )


Posterior Update (after observing X = x) :
n0 = n + 1
nµ̂+x
µ̂0 = n+1
n−1 1
Σ̂0 = n Σ̂ + n+1 (x − µ̂)(x − µ̂)T

Bayesian RL in Continuous POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 9 / 17
Motivation POMDP Bayesian RL Experiments Conclusion

Belief Update

But X not directly


R observable :
Pr(µ, Σ|z) ∝ Pr(µ, Σ|x) Pr(z|x) Pr(x)dx

Approximate infinite mixture by finite mixture

Particle filter :
Use particles of the form (s, φ, ψ)
φ,ψ : Normal-Wishart posterior parameters for X ,Y

Bayesian RL in Continuous POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 10 / 17
Motivation POMDP Bayesian RL Experiments Conclusion

Particle Filter

Bayesian RL in Continuous POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 11 / 17
Motivation POMDP Bayesian RL Experiments Conclusion

Online Planning
Monte Carlo Online Planning (Receding Horizon Control) :

b0

a1 a2 an
...
o1 o2 on
...
b1 b2 b3

Bayesian RL in Continuous POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 12 / 17
Motivation POMDP Bayesian RL Experiments Conclusion

Simple Robot Navigation Task


Average evolution of the return over time :

1
Prior model
0.9 Exact Model
Learning
0.8

0.7
Average Return

0.6

0.5

0.4

0.3

0.2

0.1

0
0 50 100 150 200 250
Training Steps

Bayesian RL in Continuous POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 13 / 17
Motivation POMDP Bayesian RL Experiments Conclusion

Simple Robot Navigation Task

Average accuracy of the model over time :


1

0.9

0.8

0.7

0.6
WL1

0.5

0.4

0.3

0.2

0.1

0
0 50 100 150 200 250
Training Steps

Model Accuracy is measured as follows :


WL1(b) =
P
(s,φ,ψ) b(s, φ, ψ) [||µφ − µX ||1 + ||Σφ − ΣX ||1 + ||µψ − µY ||1 + ||Σψ − ΣY ||1 ]

Bayesian RL in Continuous POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 14 / 17
Motivation POMDP Bayesian RL Experiments Conclusion

Conclusion

Presented a framework for optimal control under


model and state uncertainty.
Monte Carlo approximations for efficient tracking and
planning.
Framework can easily be extended to unknown
rewards and mixture of Gaussians model.

Bayesian RL in Continuous POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 15 / 17
Motivation POMDP Bayesian RL Experiments Conclusion

Future Work

What if gT , gO unknown ?
What if (µ, Σ) change over time ?
More efficient planning algorithms.
Apply to a real robot.

Bayesian RL in Continuous POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 16 / 17
Motivation POMDP Bayesian RL Experiments Conclusion

Thank you !

Questions
?

Bayesian RL in Continuous POMDP Stéphane Ross1 , Brahim Chaib-draa2 and Joelle Pineau1 17 / 17

You might also like