Policy Optimization

This is an automatically translated post by LLM. The original post is in Chinese. If you find any translation errors, please leave a comment to help me improve the translation. Thanks!

This article mainly reviews and summarizes reinforcement learning algorithms based on policy optimization theorems and related variants.


A Markov decision process (MDP) can be defined by a five-tuple \(<\mathcal{S},\mathcal{A},\mathcal{R},\mathcal{T},\gamma,\rho_0>\), where:

  • \(\mathcal{S}\) represents the state space, a set of states
  • \(\mathcal{A}\) represents the action space, a set of actions
  • \(\mathcal{R}:\mathcal{S}\times\mathcal{A}\rightarrow \mathbb{R}\) represents the reward function, \(\mathcal{R} (s,a)\) is the reward obtained by taking action \(a\) in state \(s\)
  • \(\mathcal{T}: \mathcal{S}\times\mathcal{A}\times\mathcal{S}\rightarrow[0,1]\) represents the state transition probability function, \(\mathcal{T}(s'|s,a)\) is the probability of transitioning from state \(s\) to state \(s'\) by taking action \(a\)
  • \(\gamma\) represents the discount factor
  • \(\rho_0:\mathcal{S}\rightarrow[0,1]\) represents the initial state distribution

The decision process of an agent is represented by a stochastic policy \(\pi:\mathcal{S}\times\mathcal{A}\rightarrow[0,1]\), where \(\pi(a|s)\) is the probability of the agent choosing action \(a\) in state \(s\). Based on policy \(\pi\), the state value function can be defined as: \[ V_\pi(s)=\mathbb{E}\left[\sum_{t=0}^\infty \gamma^t\mathcal{R}(s_t,a_t)|\pi,s\right] \] The state-action value function is: \[ Q_\pi(s,a) = \mathcal{R}(s,a)+\gamma\mathbb{E}_{s'\sim\mathcal{T}(\cdot|s,a)}[V_\pi(s')] \]

Advantage function: \[ A_\pi(s,a) = Q_\pi(s,a)-V_\pi(s) \] The agent's optimization goal is usually to maximize the discounted return on the initial state distribution, i.e.: \[ \eta(\pi) = \mathbb{E}_{s\sim\rho_0}[V_\pi(s)] \] The discounted visit frequency of a state is represented as: \[ \rho_\pi(s)=(P(s_0=s)+\gamma P(s_1=s)+\gamma^2 P(s_2=s)+...) \]

Policy gradient theorem: \[ \nabla \eta(\pi) = \sum_{s,a}\rho_\pi(s)\nabla \pi(a|s)Q_\pi(s,a) \]

Approximately Optimal Approximate Reinforcement Learning

Kakade, S. & Langford, J. Approximately Optimal Approximate Reinforcement Learning. in Proceedings of the Nineteenth International Conference on Machine Learning 267–274 (Morgan Kaufmann Publishers Inc., 2002).

This paper addresses three questions:

  1. Is there a performance metric that guarantees improvement at each update step?
  2. How difficult is it to verify that an update improves this performance metric?
  3. What level of performance can be achieved after a reasonable number of policy updates?

Consider the following conservative policy update rule: \[ \pi_{new}(a|s)=(1-\alpha)\pi(a|s)+\alpha\pi'(a|s) \] where \(\alpha\in[0,1]\). To ensure policy improvement at \(\alpha=1\), \(\pi'\) needs to take better actions than \(\pi\) in every state. For \(0<\alpha <1\), policy improvement only requires \(\pi'\) to choose better actions in most states rather than all. Define the policy advantage \(\mathbb{A}_{\pi,\rho_0}(\pi')\) as: \[ \mathbb{A}_{\pi,\rho_0}(\pi')=\mathbb{E}_{s\sim\rho_\pi}\left[\mathbb{E}_{a\sim\pi'(\cdot|s)}[A_\pi(s,a)]\right] \] This policy advantage function measures the extent to which \(\pi'\) selects actions with greater advantage under the initial state distribution \(\rho_0\). For \(\alpha=0\), it is evident that \(\frac{\partial \eta}{\partial \alpha}\big|_{\alpha=0}=\frac{1}{1-\gamma}\mathbb{A}_{\pi,\rho_0}\).

to be continue......


Schulman, J., Levine, S., Moritz, P., Jordan, M. & Abbeel, P. Trust region policy optimization. in Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37 1889–1897 (JMLR.org, 2015).

Policy advantage theorem: \[ \eta(\tilde\pi)=\eta(\pi)+\mathbb{E}_{s_0\sim\rho_0,a\sim\tilde\pi(\cdot|s),s'\sim\mathcal{T}(\cdot|s,a)}\left[\sum_{t=1}^{\infty}\gamma^tA_\pi(s_t,a_t)\right] \] Proof is as follows (for ease of writing, omit the expectation subscripts \(s_0\sim\rho_0,s'\sim\mathcal{T}(\cdot|s,a)\), \(a\sim\pi(\cdot|s)\) is abbreviated as \(a\sim\pi\)): \[ \begin{align} &\mathbb{E}_{a\sim\tilde\pi}\left[\sum_{t=0}^{\infty}\gamma^tA_\pi(s_t,a_t)\right]\\ &=\mathbb{E}_{a\sim\tilde\pi}\left[\sum_{t=0}^{\infty}\gamma^t\left[Q_\pi(s_t,a_t)-V_\pi(s_t)\right]\right]\\ &= \mathbb{E}_{a\sim\tilde\pi}\left[\sum_{t=0}^{\infty}\gamma^t\left[r_t+\gamma V_\pi(s_{t+1})-V_\pi(s_t)\right]\right]\\ &= \mathbb{E}_{a\sim\tilde\pi}\left[\sum_{t=0}^{\infty}\gamma^t r_t\right]+\mathbb{E}_{a\sim\tilde\pi}\left[\sum_{t=0}^{\infty}\gamma^{t+1} V_\pi(s_{t+1})\right]-\mathbb{E}_{a\sim\tilde\pi}\left[\sum_{t=0}^{\infty}\gamma^t V_\pi(s_t)\right]\\ &=\eta(\tilde\pi)-\mathbb{E}_{a\sim\tilde\pi}[V_\pi(s_0)]\\ &=\eta(\tilde\pi)-\eta(\pi) \end{align} \]

Based on the state visitation frequency \(\rho_\pi\), equation \(\ref{policy-improve}\) can also be written in the following form: \[ \eta(\tilde\pi)=\eta(\pi)+\sum_s\rho_{\tilde\pi}(s)\sum_a\tilde\pi(a|s)A_\pi(s,a) \] According to equation \(\ref{policy-improve}\), if for each state \(s\), \(\sum_a\tilde\pi(a|s)A_\pi(s,a)\geq 0\), then the new policy is guaranteed to be better than the old policy. However, due to the dependence of equation \(\ref{policy-improve}\) on \(\rho_{\tilde\pi}\), it is not directly computable in practical optimization. Therefore, considering a local estimate of \(\eta\): \[ L_\pi(\tilde\pi) =\eta(\pi)+\sum_s\rho_{\pi}(s)\sum_a\tilde\pi(a|s)A_\pi(s,a) \] The difference between \(L_\pi(\tilde\pi)\) and \(\eta(\tilde\pi)\) lies in replacing \(\rho_{\tilde\pi}\) with \(\rho_{\pi}\). Next, the analysis of \(L_\pi(\tilde\pi)\) is performed. If there is a parameterized policy \(\pi_\theta\) that is differentiable with respect to \(\theta\), then: \[ L_{\pi_{\theta_0}}(\pi_{\theta_0})=\eta(\pi_{\theta_0})\\ \nabla_\theta L_{\pi_{\theta_0}}(\pi_{\theta})\big|_{\theta=\theta_0}=\nabla_\theta\eta(\pi_{\theta})\big|_{\theta=\theta_0} \] Therefore, if the update of \(\pi\) is small enough, improving \(L_\pi\) will also improve \(\eta\). Considering the following policy update: \[ \tilde\pi(a|s)=(1-\alpha)\pi(a|s)+\alpha\pi'(a|s) \] where \(\pi'=\arg\max_{\pi'}L_\pi(\tilde\pi)\), it can be shown (proof in [(Kakade & Langford, 2002)][approximately_optimal]): \[ \eta(\tilde\pi)\geq L_{\pi}(\tilde\pi)-\frac{2\epsilon\gamma}{(1-\gamma(1-\alpha))(1-\gamma)}\alpha^2 \] where \(\epsilon = \max_s|\mathbb{E}_{a\sim\tilde\pi(\cdot|s)}A_\pi(s,a)|\)

Continued in the next message...