A Practical Guide to Policy Gradients¶

Duong Nguyen

2018-11-03

Background¶

Reinforcement Learning Concepts¶

MDP

State space $\mathcal{S}$: a set of states of the environment.
- state vs. observation (partial description of state)
- fully observed vs. partially observed
Action space $\mathcal{A}$: a set of actions that an agent can select at a given state
- discrete action spaces, e.g., board games
- continuous action spaces, e.g., robot
Policy: specify what action $a$ for the agent to take in a given state $s$
- Deterministic policy: $a = \pi(s)$
- Stochastic policy: $a \sim \pi(\cdot~|~s)$, but why?
  - naturally incorporate exploration
  - more robust and probably optimal in some settings, e.g., environments with partially observable states (e.g., card games)
  - can converge to a deterministic policy in the limit (informally speaking...)

Markov Decision Process (MDP) Environment:

Markov property: $P(S_{t+1}~|~S_t, A_t) = P(S_{t+1}~|~S_t, A_t, S_{t-1},\cdots)$, i.e., transitions only depend on the most recent state and action, and no prior history.
$\mu_0(\cdot)$ is the start-state distribution to sample the starting state $S_0$
State transition: $P(s'~|~s,a) = \text{Prob}[S_{t+1} = s'~|~S_t = s, A_t=a]$
Trajectory (aka, rollout or episode) is a sequence of states and actions $\tau = (s_0, a_0, s_1, a_1,\cdots)$
- $s_0 \sim \mu_0(\cdot)$, start-state distribution
- $s_{t+1} \sim P(\cdot~|~s_t, a_t)$ (stochastic environment) or $s_{t+1} = f(s_t, a_t)$ (deterministic environment)
- $a_t \sim \pi(\cdot~|~s_t)$ (stochastic policy) or $a_t = \pi(s_t)$ (deterministic policy)

Reward function $r: \mathcal{S} \times \mathcal{A} \times \mathcal{S} \rightarrow \mathbb{R}$, $r_t = r(s_t,a_t,s_{t+1})$
- other definitions in the literature: $r(s_t)$, $r(s_t,a_t)$
- scalar feedback signal to facilitate agent learning
- reward function design is difficult, e.g., self-driving car
  - an important research topic, e.g., inverse RL
Discount factor: $0 < \gamma < 1$: discount the present value of future rewards

Reference

The Sutton+Barto book Chapter 3
David Silver's lecture 2

Terminology¶

Episodic vs. Continuing
Return: cumulative rewards
- $G_t = r_{t} + \gamma r_{t+1} + \cdots = \sum_{t'=t}^{\infty~\textrm{or}~T} \gamma^{t'-t}r_{t'}$, $G_t = r_{t} + \gamma G_{t+1}$
- For a trajectory $\tau$,
  - finite horizon undiscounted return: $G(\tau) = \sum_{t=0}^T r_t$
  - infinite horizon discounted return: $G(\tau) = \sum_{t=0}^\infty \gamma^t r_t$
On-policy value functions: For a policy $\pi$,
- state-value function:
  - $V_\pi (s) = \mathbb{E}[G_t~|~S_t = s; \pi] = \mathbb{E}[r_{t} + \gamma V_\pi (S_{t+1})~|~S_t = s; \pi]$
- action-state-value function:
  - $Q_\pi(s, a) = \mathbb{E}[G_t~|~S_t=s, A_t=a; \pi] = \mathbb{E}[r_{t} + \gamma V_\pi (S_{t+1})~|~S_t = s, A_t = a]$
- Advantage function $A_\pi(s,a) := Q_\pi(s,a) - V_\pi(s)$, i.e., how much better an action $a$ is than others on average

Bellman equations:

$V_\pi(s) = \mathbb{E}_{a \sim \pi, s' \sim P}\left[r(s,a,s') + \gamma V_\pi(s') \right]$
$Q_\pi(s,a) = \mathbb{E}_{s' \sim P} \left[r(s,a,s') + \gamma \mathbb{E}_{a' \sim \pi} Q_\pi(s', a')\right]$

The Bellman backup (for a state, or state-action pair) refers to the right-hand side of the Bellman equation, the reward-plus-next-value.

Also, $V_\pi(s) = \mathbb{E}_{a \sim \pi} Q_\pi(s,a)$

Optimal value functions

$V^{*}(s) = \underset{\pi}{\text{max}} V_\pi(s)$
$Q^{*}(s,a) = \underset{\pi}{\text{max}} Q_\pi(s,a)$
$V^{*}(s) = \text{max}_{a} Q^{*}(s,a)$

Exercise: Write Bellman equations

Policy Optimization¶

Big picture

Source

Parameterized policy with parameters $\theta \in \mathbb{R}^d$,

$\pi_\theta(a~|~s)$ (stochastic)
$\pi_\theta(s)$ (deterministic)

Objective function: $J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[G(\tau)\right]$

Optimization: $$\theta^{*} = \underset{\theta \in \mathbb{R}^d}{\text{ argmax }} J(\theta),$$ which can be solved with:

Gradient ascent
Gradient-free/Blackbox optimization, e.g., hill-climbing, genetic algorithms

Gradient Ascent¶

Find a local maximum of the objective $J(\theta)$: $\theta^{\text{next}} \leftarrow \theta^{\text{cur}} + \alpha \nabla_\theta J(\theta)_{|\theta = \theta^{cur}},$

where:

policy gradient: $\nabla_\theta J(\theta) = \left[\partial J/\partial \theta_1,\cdots, \partial J/\partial \theta_d\right]^T$,
step-size (learning rate): $\alpha > 0$

$\rightarrow$ the main problem is to compute the policy gradient!

Finite difference method

$$\frac{\partial J}{\partial \theta_i} \approx \frac{J(\theta + \epsilon u^{(i)}) - J(\theta)}{\epsilon},$$ where $u^{(i)}$ ($i=1,\cdots,d$) is the d-dimensional unit vector with $u^{(i)}_i = 1, u^{(i)}_j = 0$ for $j \neq i$

But can we compute the gradient analytically? $\rightarrow$ Yes, Policy Gradient methods

Policy Gradients¶

Can we compute the policy gradient analytically?

Yes, use the log-derivative trick, assumed that the policy is differentiable w.r.t the parameters $\theta$.

Warm-up: One-step MDP¶

$$J(\theta) = \mathbb{E}_{s \sim \mu_0(\cdot),~a \sim \pi_\theta(\cdot~|~s)}\left[r(s,a)\right] = \sum_s \mu_0(s) \sum_a \pi_\theta(a~|~s) r(s,a)$$

$$\nabla_\theta J(\theta) = \sum_s \mu_0(s) \sum_a \nabla_\theta \pi_\theta(a~|~s) r(s,a)$$

Applying the log-derivative trick, $\nabla_\theta \pi_\theta(a~|~s) = \pi_\theta(a~|~s)\frac{\nabla_\theta \pi_\theta(a~|~s)}{\pi_\theta(a~|~s)} = \pi_\theta(a~|~s) \nabla_\theta\log \pi_\theta(a~|~s)$,

$$\nabla_\theta J(\theta) = \mathbb{E}_{s \sim \mu_0(\cdot),~a \sim \pi_\theta(\cdot~|~s)}[\nabla_\theta\log \pi_\theta(a~|~s) r(s,a)]$$

Multi-step MDP¶

Episodic process

Consider episodic setting with finite-horizon undiscounted return, but a similar derivation applies to continuing setting with infinite-horizon discounted return

Trajectory $\tau = (s_0, a_0, s_1, \cdots, s_{T-1}, a_{T-1}, s_T)$, which is sampled with the following process:

$s_0 \sim \mu_0(\cdot),~a_0 \sim \pi_\theta(\cdot~|~s_0),~s_1 \sim P(\cdot~|~s_0, a_0)$
$s_{t+1} \sim P(\cdot~|~s_{t}, a_{t})$ for $t=0,\cdots,T-1$
$s_T$ is a terminal state, ending the trajectory

Probability of a trajectory given a policy $\pi_\theta$:

$$P(\tau; \pi_\theta) = \mu_0(s_0)\prod_{t \geq 0}\pi_\theta(a_t~|~s_t)P(s_{t+1}~|~s_t, a_t)$$ $$\log P(\tau; \pi_\theta) = \log\mu_0(s_0) + \sum_{t\geq 0}\log P(s_{t+1}~|~s_t, a_t) + \sum_{t\geq 0}\log\pi_\theta(a_t~|~s_t)$$

Policy Gradient Derivation¶

$J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[G(\tau)\right] = \int_\tau G(\tau)P(\tau; \pi_\theta) d\tau$

$ \begin{matrix} \nabla_\theta J(\theta) & = &\nabla_\theta \int_\tau G(\tau)P(\tau; \pi_\theta) d\tau = \int_\tau G(\tau) P(\tau; \pi_\theta) \frac{\nabla_\theta P(\tau; \pi_\theta)}{P(\tau; \pi_\theta)} d\tau \\ & = & \int_\tau G(\tau) P(\tau; \pi_\theta) \nabla_\theta \log P(\tau; \pi_\theta) d\tau = \mathbb{E}_{\tau \sim \pi_\theta}\left[G(\tau) \nabla_\theta \log P(\tau; \pi_\theta)\right] \end{matrix} $

Note: $ \log p(\tau; \pi_\theta) = \log\mu_0(s_0) + \sum_{t\geq 0}\log P(s_{t+1}~|~s_t, a_t) + \sum_{t\geq 0}\log\pi_\theta(a_t~|~s_t) $

$\rightarrow \nabla_\theta \log p(\tau; \pi_\theta) = \sum_{t\geq 0}\nabla_\theta \log\pi_\theta(a_t~|~s_t)$

Policy Gradient: $$\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[G(\tau)\sum_{t\geq 0}\nabla_\theta \log\pi_\theta(a_t~|~s_t)\right]$$

Gradient Estimator¶

Collect trajectories $\{\tau^{(i)}\}_{i=1}^N$, and estimate the gradient with sample mean: $$\nabla_\theta J(\theta) \approx \frac{1}{N}\sum_{i=1}^N \left[G(\tau^{(i)})\sum_{t\geq 0}\nabla_\theta\log\pi_\theta(a_t^{(i)}~|~s_t^{(i)})\right] = \frac{1}{N}\sum_{i=1}^N \left[(\sum_{t \geq 0}r_{t}^{(i)})\sum_{t\geq 0}\nabla_\theta\log\pi_\theta(a_t^{(i)}~|~s_t^{(i)})\right]~~(1)$$

c.f. Maximum likelihood objective when predicting actions from states with samples $\{s^{(i)}_t, a^{(i)}_t\}_{t\geq 0, i=1,\cdots,N}$:

$$\nabla_\theta J_{MLE}(\theta) \approx \frac{1}{N}\sum_{i=1}^N \left[\sum_{t\geq 0}\nabla_\theta\log\pi_\theta(a_t^{(i)}~|~s_t^{(i)})\right]$$

Intuitively, the PG objective can be seen as the weighted MLE, in which

a good episode with high return $r(\tau)$ is made more likely; this basically increases the probabilities of the actions taken in the episode
a bad episode with low return is made less likely; and in turns the probabilities of the actions taken in the episode are pushed down

Gradient Estimator (2)¶

$$\widehat{g} = \frac{1}{N}\sum_{i=1}^N \left[\sum_{t \geq 0}r_{t}^{(i)}\sum_{t\geq 0}\nabla_\theta\log\pi_\theta(a_t^{(i)}~|~s_t^{(i)})\right]$$

Unbiased
High variance

Variance Reduction¶

$\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[G(\tau)\sum_{t\geq 0}\nabla_\theta \log\pi_\theta(a_t~|~s_t)\right] = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t\geq 0}\nabla_\theta \log\pi_\theta(a_t~|~s_t)\left(\sum_{t\geq 0} r_t\right)~\right]$

Reward-To-Go¶

Idea: only rewards obtained after taking an action should be used to evaluate how good the action is,

$$\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t\geq 0}\nabla_\theta \log\pi_\theta(a_t~|~s_t)\left(\sum_{t'\geq t} r_{t'}\right)\right]$$

Proof

Why does it reduce variance?

Intuitively, the sum of fewer random terms has smaller variance

Discounting future rewards¶

Idea: Discount future rewards to further reduce their values when evaluating the current action $\rightarrow$ introduce bias to reduce variance

$$\nabla_\theta J(\theta) \approx \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t\geq 0}\nabla_\theta \log\pi_\theta(a_t~|~s_t)\left(\sum_{t'\geq t} \gamma^{t'-t} r_{t'}\right)\right]$$

Vanilla PG (aka REINFORCE)¶

Vanilla PG

Baseline Subtraction¶

Idea: are you doing better than baseline

$$\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t\geq 0}\nabla_\theta \log\pi_\theta(a_t~|~s_t)\left(\sum_{t'\geq t} r_{t'} - b(s_t) \right)\right]$$

How much the sample return is better than some baseline $b(s_t)$ (e.g., the average return)

$b(s_t) = V_\pi(s_t)$

Why does subtracting baseline reduce variance?

See this paper for rigorous proof
Whiteboard exercise?

REINFORCE with baseline¶

Vanilla PG with baseline

Yet another PG formula¶

$$\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t\geq 0}\nabla_\theta \log\pi_\theta(a_t~|~s_t)~Q_\pi(s_t, a_t)\right]$$

Proof

Problems with vanilla PGs¶

High variance thus unstable and slow convergence
- Combine with value function approximation, e.g., advantage actor-critic
Sample intensive
Credit assignment is difficult
How to select reasonable step-size?
- TRPO, PPO (Advanced topics, see e.g., NIPS 2016 tutorial)

Gradient Formula with Advantage Function¶

$$\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}[\nabla_\theta\log \pi_\theta(a~|~s)~A_{\pi_\theta}(s,a)]$$

Advantage function: $A_\pi(s,a) = Q_\pi(s,a) - V_\pi(s)$
- How good the action $a$ is w.r.t the average

Why?
- $\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}[\nabla_\theta\log \pi_\theta(a~|~s)Q_{\pi_\theta}(s,a)] = \mathbb{E}_{\pi_\theta}[\nabla_\theta\log \pi_\theta(a~|~s)(Q_{\pi_\theta}(s,a) - V_{\pi_\theta}(s))]$
- Subtracting state-dependent baseline $V_{\pi_\theta}(s)$ is still unbiased

(Online) Actor-Critic¶

Can estimate advantage function with full sampled episodes (Monte Carlo approach)

But, do we need full episodes to evaluate the advantage function?

No, use value function approximation, e.g., $Q_\pi(s, a) = Q(s,a; \boldsymbol{w_Q}), V_\pi(s) = V(s; \boldsymbol{w_V})$
The idea is to update the value function approximator $\boldsymbol{w_Q}, \boldsymbol{w_V}$ with each interaction ${(s_t, a_t, s_{t+1}, r_{t})}$

Intuition¶

Approximate learning with supervision
Actor: policy
Critic:
- evaluate the actor action at each timestep
- "re-weight" policy gradient
Both actor and critic update online after each interaction

Temporal Difference (TD) Error¶

Estimate advantage function with TD error

$\delta_t = r_t + \gamma V_{\pi_\theta}(s_{t+1}) - V_{\pi_\theta}(s_t)$

$\delta_{\pi_\theta} = \mathbb{E}_{\pi_\theta}[\delta_t] = \mathbb{E}_{\pi_\theta}[r_t + \gamma V_{\pi_\theta}(s_{t+1}) - V_{\pi_\theta}(s_t)] = Q_{\pi_\theta}(s,a) - V_{\pi_\theta}(s) = A_{\pi_\theta}(s,a)$

$\rightarrow$ Can use TD error for the gradient: $$\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}[\nabla_\theta\log \pi_\theta(a~|~s)~\delta_{\pi_\theta}]$$

$\rightarrow$ Estimate the TD error with one value function approximation, $V(\cdot, w_V)$,

$$\delta_t = r_t + \gamma V(s_{t+1}; \boldsymbol{w}) - V(s_t; \boldsymbol{w})$$

Advantage Actor-Critic PG with one-step TD¶

One-step Actor-Critic

Work with continuing tasks since it doesn't require full episode
Trade bias for variance reduction due to value function bootstrapping

N-step TD Error¶

Middleground between one-step TD and Monte Carlo (~all-step TD)

$\delta^{(n)}_t = r_t + \gamma r_{t+1} + \cdots + \gamma^{n}r_{t+n} - V(s_{t}; \boldsymbol{w_V})$

small $n$: low variance, but high bias
large $n$: high variance, but low bias

Advanced: Combine all $n$-step TD errors into one (~ $TD(\lambda)$ idea) $\rightarrow$ Generalized Advantage Estimator (GAE - John Schulman+etal)

Advantage Actor-Critic with n-step TD¶

A2C

Recap¶

Policy Gradient is basically Gradient Ascent with:

Techniques to reduce variance and improve stability
Techniques to improve sample efficiency

Various gradient formula: $\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t\geq 0}\nabla_\theta \log\pi_\theta(a_t~|~s_t)~\Phi_t\right]$

Replace $\Phi_t$ with $G_t$, $Q_\pi, A_\pi, \delta_\pi$

Advanced PGs¶

TRPO $\rightarrow$ PPO See also, e.g., this NIPS 2016 tutorial
DDPG Off-policy actor-critic for continuous actions
State-action dependent baseline methods, e.g., QProp

I'll probably write short blog posts or slides about these topics in the future. Stay tuned :)