# Policy Optimization

This is an automatically translated post by LLM. The original post is in Chinese. If you find any translation errors, please leave a comment to help me improve the translation. Thanks!

This article mainly reviews and summarizes reinforcement learning algorithms based on policy optimization theorems and related variants.

## Preliminaries

A Markov decision process (MDP) can be defined by a five-tuple \(<\mathcal{S},\mathcal{A},\mathcal{R},\mathcal{T},\gamma,\rho_0>\), where:

- \(\mathcal{S}\) represents the state space, a set of states
- \(\mathcal{A}\) represents the action space, a set of actions
- \(\mathcal{R}:\mathcal{S}\times\mathcal{A}\rightarrow \mathbb{R}\) represents the reward function, \(\mathcal{R} (s,a)\) is the reward obtained by taking action \(a\) in state \(s\)
- \(\mathcal{T}: \mathcal{S}\times\mathcal{A}\times\mathcal{S}\rightarrow[0,1]\) represents the state transition probability function, \(\mathcal{T}(s'|s,a)\) is the probability of transitioning from state \(s\) to state \(s'\) by taking action \(a\)
- \(\gamma\) represents the discount factor
- \(\rho_0:\mathcal{S}\rightarrow[0,1]\) represents the initial state distribution

The decision process of an agent is represented by a stochastic policy \(\pi:\mathcal{S}\times\mathcal{A}\rightarrow[0,1]\), where \(\pi(a|s)\) is the probability of the agent choosing action \(a\) in state \(s\). Based on policy \(\pi\), the state value function can be defined as: \[ V_\pi(s)=\mathbb{E}\left[\sum_{t=0}^\infty \gamma^t\mathcal{R}(s_t,a_t)|\pi,s\right] \] The state-action value function is: \[ Q_\pi(s,a) = \mathcal{R}(s,a)+\gamma\mathbb{E}_{s'\sim\mathcal{T}(\cdot|s,a)}[V_\pi(s')] \]

Advantage function: \[ A_\pi(s,a) = Q_\pi(s,a)-V_\pi(s) \] The agent's optimization goal is usually to maximize the discounted return on the initial state distribution, i.e.: \[ \eta(\pi) = \mathbb{E}_{s\sim\rho_0}[V_\pi(s)] \] The discounted visit frequency of a state is represented as: \[ \rho_\pi(s)=(P(s_0=s)+\gamma P(s_1=s)+\gamma^2 P(s_2=s)+...) \]

Policy gradient theorem: \[ \nabla \eta(\pi) = \sum_{s,a}\rho_\pi(s)\nabla \pi(a|s)Q_\pi(s,a) \]

## Approximately Optimal Approximate Reinforcement Learning

Kakade, S. & Langford, J. Approximately Optimal Approximate
Reinforcement Learning. in *Proceedings of the Nineteenth
International Conference on Machine Learning* 267–274 (Morgan
Kaufmann Publishers Inc., 2002).

This paper addresses three questions:

- Is there a performance metric that guarantees improvement at each update step?
- How difficult is it to verify that an update improves this performance metric?
- What level of performance can be achieved after a reasonable number of policy updates?

Consider the following conservative policy update rule: \[ \pi_{new}(a|s)=(1-\alpha)\pi(a|s)+\alpha\pi'(a|s) \] where \(\alpha\in[0,1]\). To ensure policy improvement at \(\alpha=1\), \(\pi'\) needs to take better actions than \(\pi\) in every state. For \(0<\alpha <1\), policy improvement only requires \(\pi'\) to choose better actions in most states rather than all. Define the policy advantage \(\mathbb{A}_{\pi,\rho_0}(\pi')\) as: \[ \mathbb{A}_{\pi,\rho_0}(\pi')=\mathbb{E}_{s\sim\rho_\pi}\left[\mathbb{E}_{a\sim\pi'(\cdot|s)}[A_\pi(s,a)]\right] \] This policy advantage function measures the extent to which \(\pi'\) selects actions with greater advantage under the initial state distribution \(\rho_0\). For \(\alpha=0\), it is evident that \(\frac{\partial \eta}{\partial \alpha}\big|_{\alpha=0}=\frac{1}{1-\gamma}\mathbb{A}_{\pi,\rho_0}\).

to be continue......

## TRPO

Schulman, J., Levine, S., Moritz, P., Jordan, M. & Abbeel, P.
Trust region policy optimization. in *Proceedings of the 32nd
International Conference on International Conference on Machine Learning
- Volume 37* 1889–1897 (JMLR.org, 2015).

Policy advantage theorem: \[ \eta(\tilde\pi)=\eta(\pi)+\mathbb{E}_{s_0\sim\rho_0,a\sim\tilde\pi(\cdot|s),s'\sim\mathcal{T}(\cdot|s,a)}\left[\sum_{t=1}^{\infty}\gamma^tA_\pi(s_t,a_t)\right] \] Proof is as follows (for ease of writing, omit the expectation subscripts \(s_0\sim\rho_0,s'\sim\mathcal{T}(\cdot|s,a)\), \(a\sim\pi(\cdot|s)\) is abbreviated as \(a\sim\pi\)): \[ \begin{align} &\mathbb{E}_{a\sim\tilde\pi}\left[\sum_{t=0}^{\infty}\gamma^tA_\pi(s_t,a_t)\right]\\ &=\mathbb{E}_{a\sim\tilde\pi}\left[\sum_{t=0}^{\infty}\gamma^t\left[Q_\pi(s_t,a_t)-V_\pi(s_t)\right]\right]\\ &= \mathbb{E}_{a\sim\tilde\pi}\left[\sum_{t=0}^{\infty}\gamma^t\left[r_t+\gamma V_\pi(s_{t+1})-V_\pi(s_t)\right]\right]\\ &= \mathbb{E}_{a\sim\tilde\pi}\left[\sum_{t=0}^{\infty}\gamma^t r_t\right]+\mathbb{E}_{a\sim\tilde\pi}\left[\sum_{t=0}^{\infty}\gamma^{t+1} V_\pi(s_{t+1})\right]-\mathbb{E}_{a\sim\tilde\pi}\left[\sum_{t=0}^{\infty}\gamma^t V_\pi(s_t)\right]\\ &=\eta(\tilde\pi)-\mathbb{E}_{a\sim\tilde\pi}[V_\pi(s_0)]\\ &=\eta(\tilde\pi)-\eta(\pi) \end{align} \]

Based on the state visitation frequency \(\rho_\pi\), equation \(\ref{policy-improve}\) can also be written in the following form: \[ \eta(\tilde\pi)=\eta(\pi)+\sum_s\rho_{\tilde\pi}(s)\sum_a\tilde\pi(a|s)A_\pi(s,a) \] According to equation \(\ref{policy-improve}\), if for each state \(s\), \(\sum_a\tilde\pi(a|s)A_\pi(s,a)\geq 0\), then the new policy is guaranteed to be better than the old policy. However, due to the dependence of equation \(\ref{policy-improve}\) on \(\rho_{\tilde\pi}\), it is not directly computable in practical optimization. Therefore, considering a local estimate of \(\eta\): \[ L_\pi(\tilde\pi) =\eta(\pi)+\sum_s\rho_{\pi}(s)\sum_a\tilde\pi(a|s)A_\pi(s,a) \] The difference between \(L_\pi(\tilde\pi)\) and \(\eta(\tilde\pi)\) lies in replacing \(\rho_{\tilde\pi}\) with \(\rho_{\pi}\). Next, the analysis of \(L_\pi(\tilde\pi)\) is performed. If there is a parameterized policy \(\pi_\theta\) that is differentiable with respect to \(\theta\), then: \[ L_{\pi_{\theta_0}}(\pi_{\theta_0})=\eta(\pi_{\theta_0})\\ \nabla_\theta L_{\pi_{\theta_0}}(\pi_{\theta})\big|_{\theta=\theta_0}=\nabla_\theta\eta(\pi_{\theta})\big|_{\theta=\theta_0} \] Therefore, if the update of \(\pi\) is small enough, improving \(L_\pi\) will also improve \(\eta\). Considering the following policy update: \[ \tilde\pi(a|s)=(1-\alpha)\pi(a|s)+\alpha\pi'(a|s) \] where \(\pi'=\arg\max_{\pi'}L_\pi(\tilde\pi)\), it can be shown (proof in [(Kakade & Langford, 2002)][approximately_optimal]): \[ \eta(\tilde\pi)\geq L_{\pi}(\tilde\pi)-\frac{2\epsilon\gamma}{(1-\gamma(1-\alpha))(1-\gamma)}\alpha^2 \] where \(\epsilon = \max_s|\mathbb{E}_{a\sim\tilde\pi(\cdot|s)}A_\pi(s,a)|\)

Continued in the next message...