RHLF
Reinforcement Learning Overview
David Silver in his introductory class on RL makes the following distinctions between reinforcement learning and supervised learning:
- "There is no supervisor, only a reward signal". I am not a fan of this distinction because in supervised learning we don't have a supervisor. Generally, we're not trying to perfectly interpolate the data. Rather, we have a reward signal of the loss function \((y - f_{\theta}(x))^2\)
- "Feedback is delayed, not instantaneous"
- "Time really matters (sequential, non i.i.d data)" and "Agent's actions affect the subsequent data it receives"
The RL Problem
Let \(\Theta\) be a parameter space and \(\theta \in \Theta\) a parameter vector that governs the dynamics of a stochastic process: \(\big(\Omega, \mathcal{F}, \mathbb{P}_{\theta}\big)\). We are interested in identifying the parameter \(\theta^*\) that maximizes the expected cumulative reward over time. Formally, the optimization problem can be stated as:
We can define the following two random variables (Trajectories, and Rewards of Trajectories)
Then the expected reward is denoted by:
Policy Search
The idea behind policy search is that we are going to approximate the above value by constructing a new probability space which is just the `n' product of the original probability space:
On this probability space, we can define our estimator:
RHLF
- the challenge with this set-up is the effective domain of the reward model
- We use the KL constraint to try and account for this issue