A Practical Guide to Policy Gradients

Duong Nguyen

2018-11-03

Background

Reinforcement Learning Concepts

MDP

  • State space $\mathcal{S}$: a set of states of the environment.

    • state vs. observation (partial description of state)
    • fully observed vs. partially observed
  • Action space $\mathcal{A}$: a set of actions that an agent can select at a given state

    • discrete action spaces, e.g., board games
    • continuous action spaces, e.g., robot
  • Policy: specify what action $a$ for the agent to take in a given state $s$
    • Deterministic policy: $a = \pi(s)$
    • Stochastic policy: $a \sim \pi(\cdot~|~s)$, but why?
      • naturally incorporate exploration
      • more robust and probably optimal in some settings, e.g., environments with partially observable states (e.g., card games)
      • can converge to a deterministic policy in the limit (informally speaking...)

Markov Decision Process (MDP) Environment:

  • Markov property: $P(S_{t+1}~|~S_t, A_t) = P(S_{t+1}~|~S_t, A_t, S_{t-1},\cdots)$, i.e., transitions only depend on the most recent state and action, and no prior history.

  • $\mu_0(\cdot)$ is the start-state distribution to sample the starting state $S_0$

  • State transition: $P(s'~|~s,a) = \text{Prob}[S_{t+1} = s'~|~S_t = s, A_t=a]$

  • Trajectory (aka, rollout or episode) is a sequence of states and actions $\tau = (s_0, a_0, s_1, a_1,\cdots)$

    • $s_0 \sim \mu_0(\cdot)$, start-state distribution
    • $s_{t+1} \sim P(\cdot~|~s_t, a_t)$ (stochastic environment) or $s_{t+1} = f(s_t, a_t)$ (deterministic environment)
    • $a_t \sim \pi(\cdot~|~s_t)$ (stochastic policy) or $a_t = \pi(s_t)$ (deterministic policy)
  • Reward function $r: \mathcal{S} \times \mathcal{A} \times \mathcal{S} \rightarrow \mathbb{R}$, $r_t = r(s_t,a_t,s_{t+1})$

    • other definitions in the literature: $r(s_t)$, $r(s_t,a_t)$
    • scalar feedback signal to facilitate agent learning
    • reward function design is difficult, e.g., self-driving car
  • Discount factor: $0 < \gamma < 1$: discount the present value of future rewards

Reference

Terminology

  • Episodic vs. Continuing
  • Return: cumulative rewards

    • $G_t = r_{t} + \gamma r_{t+1} + \cdots = \sum_{t'=t}^{\infty~\textrm{or}~T} \gamma^{t'-t}r_{t'}$, $G_t = r_{t} + \gamma G_{t+1}$

    • For a trajectory $\tau$,

      • finite horizon undiscounted return: $G(\tau) = \sum_{t=0}^T r_t$
      • infinite horizon discounted return: $G(\tau) = \sum_{t=0}^\infty \gamma^t r_t$
  • On-policy value functions: For a policy $\pi$,

    • state-value function:

      • $V_\pi (s) = \mathbb{E}[G_t~|~S_t = s; \pi] = \mathbb{E}[r_{t} + \gamma V_\pi (S_{t+1})~|~S_t = s; \pi]$
    • action-state-value function:

      • $Q_\pi(s, a) = \mathbb{E}[G_t~|~S_t=s, A_t=a; \pi] = \mathbb{E}[r_{t} + \gamma V_\pi (S_{t+1})~|~S_t = s, A_t = a]$
    • Advantage function $A_\pi(s,a) := Q_\pi(s,a) - V_\pi(s)$, i.e., how much better an action $a$ is than others on average

Bellman equations:

  • $V_\pi(s) = \mathbb{E}_{a \sim \pi, s' \sim P}\left[r(s,a,s') + \gamma V_\pi(s') \right]$

  • $Q_\pi(s,a) = \mathbb{E}_{s' \sim P} \left[r(s,a,s') + \gamma \mathbb{E}_{a' \sim \pi} Q_\pi(s', a')\right]$

The Bellman backup (for a state, or state-action pair) refers to the right-hand side of the Bellman equation, the reward-plus-next-value.

Also, $V_\pi(s) = \mathbb{E}_{a \sim \pi} Q_\pi(s,a)$

Optimal value functions

  • $V^{*}(s) = \underset{\pi}{\text{max}} V_\pi(s)$

  • $Q^{*}(s,a) = \underset{\pi}{\text{max}} Q_\pi(s,a)$

  • $V^{*}(s) = \text{max}_{a} Q^{*}(s,a)$

Exercise: Write Bellman equations

Policy Optimization

Big picture

Source

Parameterized policy with parameters $\theta \in \mathbb{R}^d$,

  • $\pi_\theta(a~|~s)$ (stochastic)
  • $\pi_\theta(s)$ (deterministic)

Objective function: $J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[G(\tau)\right]$

Optimization: $$\theta^{*} = \underset{\theta \in \mathbb{R}^d}{\text{ argmax }} J(\theta),$$ which can be solved with:

  • Gradient ascent
  • Gradient-free/Blackbox optimization, e.g., hill-climbing, genetic algorithms

Gradient Ascent

Find a local maximum of the objective $J(\theta)$: $\theta^{\text{next}} \leftarrow \theta^{\text{cur}} + \alpha \nabla_\theta J(\theta)_{|\theta = \theta^{cur}},$

where:

  • policy gradient: $\nabla_\theta J(\theta) = \left[\partial J/\partial \theta_1,\cdots, \partial J/\partial \theta_d\right]^T$,
  • step-size (learning rate): $\alpha > 0$

$\rightarrow$ the main problem is to compute the policy gradient!

Finite difference method

$$\frac{\partial J}{\partial \theta_i} \approx \frac{J(\theta + \epsilon u^{(i)}) - J(\theta)}{\epsilon},$$ where $u^{(i)}$ ($i=1,\cdots,d$) is the d-dimensional unit vector with $u^{(i)}_i = 1, u^{(i)}_j = 0$ for $j \neq i$

But can we compute the gradient analytically? $\rightarrow$ Yes, Policy Gradient methods

Policy Gradients

Can we compute the policy gradient analytically?

  • Yes, use the log-derivative trick, assumed that the policy is differentiable w.r.t the parameters $\theta$.

Warm-up: One-step MDP

$$J(\theta) = \mathbb{E}_{s \sim \mu_0(\cdot),~a \sim \pi_\theta(\cdot~|~s)}\left[r(s,a)\right] = \sum_s \mu_0(s) \sum_a \pi_\theta(a~|~s) r(s,a)$$

$$\nabla_\theta J(\theta) = \sum_s \mu_0(s) \sum_a \nabla_\theta \pi_\theta(a~|~s) r(s,a)$$

Applying the log-derivative trick, $\nabla_\theta \pi_\theta(a~|~s) = \pi_\theta(a~|~s)\frac{\nabla_\theta \pi_\theta(a~|~s)}{\pi_\theta(a~|~s)} = \pi_\theta(a~|~s) \nabla_\theta\log \pi_\theta(a~|~s)$,

$$\nabla_\theta J(\theta) = \mathbb{E}_{s \sim \mu_0(\cdot),~a \sim \pi_\theta(\cdot~|~s)}[\nabla_\theta\log \pi_\theta(a~|~s) r(s,a)]$$

Multi-step MDP

Episodic process

Consider episodic setting with finite-horizon undiscounted return, but a similar derivation applies to continuing setting with infinite-horizon discounted return

Trajectory $\tau = (s_0, a_0, s_1, \cdots, s_{T-1}, a_{T-1}, s_T)$, which is sampled with the following process:

  • $s_0 \sim \mu_0(\cdot),~a_0 \sim \pi_\theta(\cdot~|~s_0),~s_1 \sim P(\cdot~|~s_0, a_0)$

  • $s_{t+1} \sim P(\cdot~|~s_{t}, a_{t})$ for $t=0,\cdots,T-1$

  • $s_T$ is a terminal state, ending the trajectory

Probability of a trajectory given a policy $\pi_\theta$:

$$P(\tau; \pi_\theta) = \mu_0(s_0)\prod_{t \geq 0}\pi_\theta(a_t~|~s_t)P(s_{t+1}~|~s_t, a_t)$$ $$\log P(\tau; \pi_\theta) = \log\mu_0(s_0) + \sum_{t\geq 0}\log P(s_{t+1}~|~s_t, a_t) + \sum_{t\geq 0}\log\pi_\theta(a_t~|~s_t)$$

Policy Gradient Derivation

$J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[G(\tau)\right] = \int_\tau G(\tau)P(\tau; \pi_\theta) d\tau$

$ \begin{matrix} \nabla_\theta J(\theta) & = &\nabla_\theta \int_\tau G(\tau)P(\tau; \pi_\theta) d\tau = \int_\tau G(\tau) P(\tau; \pi_\theta) \frac{\nabla_\theta P(\tau; \pi_\theta)}{P(\tau; \pi_\theta)} d\tau \\ & = & \int_\tau G(\tau) P(\tau; \pi_\theta) \nabla_\theta \log P(\tau; \pi_\theta) d\tau = \mathbb{E}_{\tau \sim \pi_\theta}\left[G(\tau) \nabla_\theta \log P(\tau; \pi_\theta)\right] \end{matrix} $

Note: $ \log p(\tau; \pi_\theta) = \log\mu_0(s_0) + \sum_{t\geq 0}\log P(s_{t+1}~|~s_t, a_t) + \sum_{t\geq 0}\log\pi_\theta(a_t~|~s_t) $

$\rightarrow \nabla_\theta \log p(\tau; \pi_\theta) = \sum_{t\geq 0}\nabla_\theta \log\pi_\theta(a_t~|~s_t)$

Policy Gradient: $$\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[G(\tau)\sum_{t\geq 0}\nabla_\theta \log\pi_\theta(a_t~|~s_t)\right]$$

Gradient Estimator

  • Collect trajectories $\{\tau^{(i)}\}_{i=1}^N$, and estimate the gradient with sample mean: $$\nabla_\theta J(\theta) \approx \frac{1}{N}\sum_{i=1}^N \left[G(\tau^{(i)})\sum_{t\geq 0}\nabla_\theta\log\pi_\theta(a_t^{(i)}~|~s_t^{(i)})\right] = \frac{1}{N}\sum_{i=1}^N \left[(\sum_{t \geq 0}r_{t}^{(i)})\sum_{t\geq 0}\nabla_\theta\log\pi_\theta(a_t^{(i)}~|~s_t^{(i)})\right]~~(1)$$

c.f. Maximum likelihood objective when predicting actions from states with samples $\{s^{(i)}_t, a^{(i)}_t\}_{t\geq 0, i=1,\cdots,N}$:

$$\nabla_\theta J_{MLE}(\theta) \approx \frac{1}{N}\sum_{i=1}^N \left[\sum_{t\geq 0}\nabla_\theta\log\pi_\theta(a_t^{(i)}~|~s_t^{(i)})\right]$$

Intuitively, the PG objective can be seen as the weighted MLE, in which

  • a good episode with high return $r(\tau)$ is made more likely; this basically increases the probabilities of the actions taken in the episode
  • a bad episode with low return is made less likely; and in turns the probabilities of the actions taken in the episode are pushed down

Gradient Estimator (2)

$$\widehat{g} = \frac{1}{N}\sum_{i=1}^N \left[\sum_{t \geq 0}r_{t}^{(i)}\sum_{t\geq 0}\nabla_\theta\log\pi_\theta(a_t^{(i)}~|~s_t^{(i)})\right]$$

  • Unbiased
  • High variance

Variance Reduction

$\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[G(\tau)\sum_{t\geq 0}\nabla_\theta \log\pi_\theta(a_t~|~s_t)\right] = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t\geq 0}\nabla_\theta \log\pi_\theta(a_t~|~s_t)\left(\sum_{t\geq 0} r_t\right)~\right]$

Reward-To-Go

Idea: only rewards obtained after taking an action should be used to evaluate how good the action is,

$$\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t\geq 0}\nabla_\theta \log\pi_\theta(a_t~|~s_t)\left(\sum_{t'\geq t} r_{t'}\right)\right]$$

Proof

Why does it reduce variance?

  • Intuitively, the sum of fewer random terms has smaller variance

Discounting future rewards

Idea: Discount future rewards to further reduce their values when evaluating the current action $\rightarrow$ introduce bias to reduce variance

$$\nabla_\theta J(\theta) \approx \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t\geq 0}\nabla_\theta \log\pi_\theta(a_t~|~s_t)\left(\sum_{t'\geq t} \gamma^{t'-t} r_{t'}\right)\right]$$

Vanilla PG (aka REINFORCE)

Vanilla PG

Baseline Subtraction

Idea: are you doing better than baseline

$$\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t\geq 0}\nabla_\theta \log\pi_\theta(a_t~|~s_t)\left(\sum_{t'\geq t} r_{t'} - b(s_t) \right)\right]$$

How much the sample return is better than some baseline $b(s_t)$ (e.g., the average return)

  • $b(s_t) = V_\pi(s_t)$

Why does subtracting baseline reduce variance?

  • See this paper for rigorous proof
  • Whiteboard exercise?

REINFORCE with baseline

Vanilla PG with baseline

Yet another PG formula

$$\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t\geq 0}\nabla_\theta \log\pi_\theta(a_t~|~s_t)~Q_\pi(s_t, a_t)\right]$$

Proof

Problems with vanilla PGs

  • High variance thus unstable and slow convergence

    • Combine with value function approximation, e.g., advantage actor-critic
  • Sample intensive

  • Credit assignment is difficult

  • How to select reasonable step-size?

Gradient Formula with Advantage Function

$$\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}[\nabla_\theta\log \pi_\theta(a~|~s)~A_{\pi_\theta}(s,a)]$$

  • Advantage function: $A_\pi(s,a) = Q_\pi(s,a) - V_\pi(s)$
    • How good the action $a$ is w.r.t the average
  • Why?
    • $\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}[\nabla_\theta\log \pi_\theta(a~|~s)Q_{\pi_\theta}(s,a)] = \mathbb{E}_{\pi_\theta}[\nabla_\theta\log \pi_\theta(a~|~s)(Q_{\pi_\theta}(s,a) - V_{\pi_\theta}(s))]$
    • Subtracting state-dependent baseline $V_{\pi_\theta}(s)$ is still unbiased

(Online) Actor-Critic

Can estimate advantage function with full sampled episodes (Monte Carlo approach)

But, do we need full episodes to evaluate the advantage function?

  • No, use value function approximation, e.g., $Q_\pi(s, a) = Q(s,a; \boldsymbol{w_Q}), V_\pi(s) = V(s; \boldsymbol{w_V})$
  • The idea is to update the value function approximator $\boldsymbol{w_Q}, \boldsymbol{w_V}$ with each interaction ${(s_t, a_t, s_{t+1}, r_{t})}$

Intuition

  • Approximate learning with supervision
  • Actor: policy
  • Critic:
    • evaluate the actor action at each timestep
    • "re-weight" policy gradient
  • Both actor and critic update online after each interaction

Temporal Difference (TD) Error

Estimate advantage function with TD error

$\delta_t = r_t + \gamma V_{\pi_\theta}(s_{t+1}) - V_{\pi_\theta}(s_t)$

$\delta_{\pi_\theta} = \mathbb{E}_{\pi_\theta}[\delta_t] = \mathbb{E}_{\pi_\theta}[r_t + \gamma V_{\pi_\theta}(s_{t+1}) - V_{\pi_\theta}(s_t)] = Q_{\pi_\theta}(s,a) - V_{\pi_\theta}(s) = A_{\pi_\theta}(s,a)$

$\rightarrow$ Can use TD error for the gradient: $$\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}[\nabla_\theta\log \pi_\theta(a~|~s)~\delta_{\pi_\theta}]$$

$\rightarrow$ Estimate the TD error with one value function approximation, $V(\cdot, w_V)$,

$$\delta_t = r_t + \gamma V(s_{t+1}; \boldsymbol{w}) - V(s_t; \boldsymbol{w})$$

Advantage Actor-Critic PG with one-step TD

One-step Actor-Critic

  • Work with continuing tasks since it doesn't require full episode
  • Trade bias for variance reduction due to value function bootstrapping

N-step TD Error

Middleground between one-step TD and Monte Carlo (~all-step TD)

$\delta^{(n)}_t = r_t + \gamma r_{t+1} + \cdots + \gamma^{n}r_{t+n} - V(s_{t}; \boldsymbol{w_V})$

  • small $n$: low variance, but high bias
  • large $n$: high variance, but low bias

Advanced: Combine all $n$-step TD errors into one (~ $TD(\lambda)$ idea) $\rightarrow$ Generalized Advantage Estimator (GAE - John Schulman+etal)

Advantage Actor-Critic with n-step TD

A2C

Recap

Policy Gradient is basically Gradient Ascent with:

  • Techniques to reduce variance and improve stability
  • Techniques to improve sample efficiency

Various gradient formula: $\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t\geq 0}\nabla_\theta \log\pi_\theta(a_t~|~s_t)~\Phi_t\right]$

  • Replace $\Phi_t$ with $G_t$, $Q_\pi, A_\pi, \delta_\pi$

Advanced PGs

  1. TRPO $\rightarrow$ PPO See also, e.g., this NIPS 2016 tutorial

  2. DDPG Off-policy actor-critic for continuous actions

  3. State-action dependent baseline methods, e.g., QProp

I'll probably write short blog posts or slides about these topics in the future. Stay tuned :)