The need for Explicit Exploration in Model-based Reinforcement Learning

In model-based reinforcement learning, we aim to optimize the expected performance of a policy in a stochastic environment by learning a transition model that includes both epistemic (structural, decays with more data) and aleatoric (noise, independent of data) uncertainty. When optimizing the policy, current algorithms typically average over both types of uncertainty. In this blog post, we discuss problems with this approach and briefly discuss a tractable variant of optimistic exploration based on our upcoming NeurIPS paper [4].