State space $\mathcal{S}$: a set of states of the environment.
Action space $\mathcal{A}$: a set of actions that an agent can select at a given state
Markov Decision Process (MDP) Environment:
Markov property: $P(S_{t+1}~|~S_t, A_t) = P(S_{t+1}~|~S_t, A_t, S_{t-1},\cdots)$, i.e., transitions only depend on the most recent state and action, and no prior history.
$\mu_0(\cdot)$ is the start-state distribution to sample the starting state $S_0$
State transition: $P(s'~|~s,a) = \text{Prob}[S_{t+1} = s'~|~S_t = s, A_t=a]$
Trajectory (aka, rollout or episode) is a sequence of states and actions $\tau = (s_0, a_0, s_1, a_1,\cdots)$
Reward function $r: \mathcal{S} \times \mathcal{A} \times \mathcal{S} \rightarrow \mathbb{R}$, $r_t = r(s_t,a_t,s_{t+1})$
Discount factor: $0 < \gamma < 1$: discount the present value of future rewards
Reference
Return: cumulative rewards
$G_t = r_{t} + \gamma r_{t+1} + \cdots = \sum_{t'=t}^{\infty~\textrm{or}~T} \gamma^{t'-t}r_{t'}$, $G_t = r_{t} + \gamma G_{t+1}$
For a trajectory $\tau$,
On-policy value functions: For a policy $\pi$,
state-value function:
action-state-value function:
Advantage function $A_\pi(s,a) := Q_\pi(s,a) - V_\pi(s)$, i.e., how much better an action $a$ is than others on average
Bellman equations:
$V_\pi(s) = \mathbb{E}_{a \sim \pi, s' \sim P}\left[r(s,a,s') + \gamma V_\pi(s') \right]$
$Q_\pi(s,a) = \mathbb{E}_{s' \sim P} \left[r(s,a,s') + \gamma \mathbb{E}_{a' \sim \pi} Q_\pi(s', a')\right]$
The Bellman backup (for a state, or state-action pair) refers to the right-hand side of the Bellman equation, the reward-plus-next-value.
Also, $V_\pi(s) = \mathbb{E}_{a \sim \pi} Q_\pi(s,a)$
Optimal value functions
$V^{*}(s) = \underset{\pi}{\text{max}} V_\pi(s)$
$Q^{*}(s,a) = \underset{\pi}{\text{max}} Q_\pi(s,a)$
$V^{*}(s) = \text{max}_{a} Q^{*}(s,a)$
Exercise: Write Bellman equations
Parameterized policy with parameters $\theta \in \mathbb{R}^d$,
Objective function: $J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[G(\tau)\right]$
Optimization: $$\theta^{*} = \underset{\theta \in \mathbb{R}^d}{\text{ argmax }} J(\theta),$$ which can be solved with:
Find a local maximum of the objective $J(\theta)$: $\theta^{\text{next}} \leftarrow \theta^{\text{cur}} + \alpha \nabla_\theta J(\theta)_{|\theta = \theta^{cur}},$
where:
$\rightarrow$ the main problem is to compute the policy gradient!
$$\frac{\partial J}{\partial \theta_i} \approx \frac{J(\theta + \epsilon u^{(i)}) - J(\theta)}{\epsilon},$$ where $u^{(i)}$ ($i=1,\cdots,d$) is the d-dimensional unit vector with $u^{(i)}_i = 1, u^{(i)}_j = 0$ for $j \neq i$
But can we compute the gradient analytically? $\rightarrow$ Yes, Policy Gradient methods
Can we compute the policy gradient analytically?
$$J(\theta) = \mathbb{E}_{s \sim \mu_0(\cdot),~a \sim \pi_\theta(\cdot~|~s)}\left[r(s,a)\right] = \sum_s \mu_0(s) \sum_a \pi_\theta(a~|~s) r(s,a)$$
$$\nabla_\theta J(\theta) = \sum_s \mu_0(s) \sum_a \nabla_\theta \pi_\theta(a~|~s) r(s,a)$$
Applying the log-derivative trick, $\nabla_\theta \pi_\theta(a~|~s) = \pi_\theta(a~|~s)\frac{\nabla_\theta \pi_\theta(a~|~s)}{\pi_\theta(a~|~s)} = \pi_\theta(a~|~s) \nabla_\theta\log \pi_\theta(a~|~s)$,
$$\nabla_\theta J(\theta) = \mathbb{E}_{s \sim \mu_0(\cdot),~a \sim \pi_\theta(\cdot~|~s)}[\nabla_\theta\log \pi_\theta(a~|~s) r(s,a)]$$
Consider episodic setting with finite-horizon undiscounted return, but a similar derivation applies to continuing setting with infinite-horizon discounted return
Trajectory $\tau = (s_0, a_0, s_1, \cdots, s_{T-1}, a_{T-1}, s_T)$, which is sampled with the following process:
$s_0 \sim \mu_0(\cdot),~a_0 \sim \pi_\theta(\cdot~|~s_0),~s_1 \sim P(\cdot~|~s_0, a_0)$
$s_{t+1} \sim P(\cdot~|~s_{t}, a_{t})$ for $t=0,\cdots,T-1$
$s_T$ is a terminal state, ending the trajectory
Probability of a trajectory given a policy $\pi_\theta$:
$$P(\tau; \pi_\theta) = \mu_0(s_0)\prod_{t \geq 0}\pi_\theta(a_t~|~s_t)P(s_{t+1}~|~s_t, a_t)$$ $$\log P(\tau; \pi_\theta) = \log\mu_0(s_0) + \sum_{t\geq 0}\log P(s_{t+1}~|~s_t, a_t) + \sum_{t\geq 0}\log\pi_\theta(a_t~|~s_t)$$
$J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[G(\tau)\right] = \int_\tau G(\tau)P(\tau; \pi_\theta) d\tau$
$ \begin{matrix} \nabla_\theta J(\theta) & = &\nabla_\theta \int_\tau G(\tau)P(\tau; \pi_\theta) d\tau = \int_\tau G(\tau) P(\tau; \pi_\theta) \frac{\nabla_\theta P(\tau; \pi_\theta)}{P(\tau; \pi_\theta)} d\tau \\ & = & \int_\tau G(\tau) P(\tau; \pi_\theta) \nabla_\theta \log P(\tau; \pi_\theta) d\tau = \mathbb{E}_{\tau \sim \pi_\theta}\left[G(\tau) \nabla_\theta \log P(\tau; \pi_\theta)\right] \end{matrix} $
Note: $ \log p(\tau; \pi_\theta) = \log\mu_0(s_0) + \sum_{t\geq 0}\log P(s_{t+1}~|~s_t, a_t) + \sum_{t\geq 0}\log\pi_\theta(a_t~|~s_t) $
$\rightarrow \nabla_\theta \log p(\tau; \pi_\theta) = \sum_{t\geq 0}\nabla_\theta \log\pi_\theta(a_t~|~s_t)$
Policy Gradient: $$\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[G(\tau)\sum_{t\geq 0}\nabla_\theta \log\pi_\theta(a_t~|~s_t)\right]$$
c.f. Maximum likelihood objective when predicting actions from states with samples $\{s^{(i)}_t, a^{(i)}_t\}_{t\geq 0, i=1,\cdots,N}$:
$$\nabla_\theta J_{MLE}(\theta) \approx \frac{1}{N}\sum_{i=1}^N \left[\sum_{t\geq 0}\nabla_\theta\log\pi_\theta(a_t^{(i)}~|~s_t^{(i)})\right]$$
Intuitively, the PG objective can be seen as the weighted MLE, in which
$$\widehat{g} = \frac{1}{N}\sum_{i=1}^N \left[\sum_{t \geq 0}r_{t}^{(i)}\sum_{t\geq 0}\nabla_\theta\log\pi_\theta(a_t^{(i)}~|~s_t^{(i)})\right]$$
$\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[G(\tau)\sum_{t\geq 0}\nabla_\theta \log\pi_\theta(a_t~|~s_t)\right] = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t\geq 0}\nabla_\theta \log\pi_\theta(a_t~|~s_t)\left(\sum_{t\geq 0} r_t\right)~\right]$
Idea: only rewards obtained after taking an action should be used to evaluate how good the action is,
$$\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t\geq 0}\nabla_\theta \log\pi_\theta(a_t~|~s_t)\left(\sum_{t'\geq t} r_{t'}\right)\right]$$
Why does it reduce variance?
Idea: Discount future rewards to further reduce their values when evaluating the current action $\rightarrow$ introduce bias to reduce variance
$$\nabla_\theta J(\theta) \approx \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t\geq 0}\nabla_\theta \log\pi_\theta(a_t~|~s_t)\left(\sum_{t'\geq t} \gamma^{t'-t} r_{t'}\right)\right]$$
Idea: are you doing better than baseline
$$\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t\geq 0}\nabla_\theta \log\pi_\theta(a_t~|~s_t)\left(\sum_{t'\geq t} r_{t'} - b(s_t) \right)\right]$$
How much the sample return is better than some baseline $b(s_t)$ (e.g., the average return)
Why does subtracting baseline reduce variance?
High variance thus unstable and slow convergence
Sample intensive
Credit assignment is difficult
How to select reasonable step-size?
$$\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}[\nabla_\theta\log \pi_\theta(a~|~s)~A_{\pi_\theta}(s,a)]$$
Can estimate advantage function with full sampled episodes (Monte Carlo approach)
But, do we need full episodes to evaluate the advantage function?
Estimate advantage function with TD error
$\delta_t = r_t + \gamma V_{\pi_\theta}(s_{t+1}) - V_{\pi_\theta}(s_t)$
$\delta_{\pi_\theta} = \mathbb{E}_{\pi_\theta}[\delta_t] = \mathbb{E}_{\pi_\theta}[r_t + \gamma V_{\pi_\theta}(s_{t+1}) - V_{\pi_\theta}(s_t)] = Q_{\pi_\theta}(s,a) - V_{\pi_\theta}(s) = A_{\pi_\theta}(s,a)$
$\rightarrow$ Can use TD error for the gradient: $$\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}[\nabla_\theta\log \pi_\theta(a~|~s)~\delta_{\pi_\theta}]$$
$\rightarrow$ Estimate the TD error with one value function approximation, $V(\cdot, w_V)$,
$$\delta_t = r_t + \gamma V(s_{t+1}; \boldsymbol{w}) - V(s_t; \boldsymbol{w})$$
Middleground between one-step TD and Monte Carlo (~all-step TD)
$\delta^{(n)}_t = r_t + \gamma r_{t+1} + \cdots + \gamma^{n}r_{t+n} - V(s_{t}; \boldsymbol{w_V})$
Advanced: Combine all $n$-step TD errors into one (~ $TD(\lambda)$ idea) $\rightarrow$ Generalized Advantage Estimator (GAE - John Schulman+etal)
Policy Gradient is basically Gradient Ascent with:
Various gradient formula: $\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t\geq 0}\nabla_\theta \log\pi_\theta(a_t~|~s_t)~\Phi_t\right]$