Policy Optimization

This is an automatically translated post by LLM. The original post is in Chinese. If you find any translation errors, please leave a comment to help me improve the translation. Thanks!

This article mainly reviews and summarizes reinforcement learning algorithms based on policy optimization theorems and related variants.

Preliminaries

A Markov decision process (MDP) can be defined by a five-tuple <S,A,R,T,γ,ρ0>, where:

  • S represents the state space, a set of states
  • A represents the action space, a set of actions
  • R:S×AR represents the reward function, R(s,a) is the reward obtained by taking action a in state s
  • T:S×A×S[0,1] represents the state transition probability function, T(s|s,a) is the probability of transitioning from state s to state s by taking action a
  • γ represents the discount factor
  • ρ0:S[0,1] represents the initial state distribution

The decision process of an agent is represented by a stochastic policy π:S×A[0,1], where π(a|s) is the probability of the agent choosing action a in state s. Based on policy π, the state value function can be defined as: (1)Vπ(s)=E[t=0γtR(st,at)|π,s] The state-action value function is: (2)Qπ(s,a)=R(s,a)+γEsT(|s,a)[Vπ(s)]

Advantage function: (3)Aπ(s,a)=Qπ(s,a)Vπ(s) The agent's optimization goal is usually to maximize the discounted return on the initial state distribution, i.e.: (4)η(π)=Esρ0[Vπ(s)] The discounted visit frequency of a state is represented as: (5)ρπ(s)=(P(s0=s)+γP(s1=s)+γ2P(s2=s)+...)

Policy gradient theorem: (6)η(π)=s,aρπ(s)π(a|s)Qπ(s,a)

Approximately Optimal Approximate Reinforcement Learning

Kakade, S. & Langford, J. Approximately Optimal Approximate Reinforcement Learning. in Proceedings of the Nineteenth International Conference on Machine Learning 267–274 (Morgan Kaufmann Publishers Inc., 2002).

This paper addresses three questions:

  1. Is there a performance metric that guarantees improvement at each update step?
  2. How difficult is it to verify that an update improves this performance metric?
  3. What level of performance can be achieved after a reasonable number of policy updates?

Consider the following conservative policy update rule: (7)πnew(a|s)=(1α)π(a|s)+απ(a|s) where α[0,1]. To ensure policy improvement at α=1, π needs to take better actions than π in every state. For 0<α<1, policy improvement only requires π to choose better actions in most states rather than all. Define the policy advantage Aπ,ρ0(π) as: (8)Aπ,ρ0(π)=Esρπ[Eaπ(|s)[Aπ(s,a)]] This policy advantage function measures the extent to which π selects actions with greater advantage under the initial state distribution ρ0. For α=0, it is evident that ηα|α=0=11γAπ,ρ0.

to be continue......

TRPO

Schulman, J., Levine, S., Moritz, P., Jordan, M. & Abbeel, P. Trust region policy optimization. in Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37 1889–1897 (JMLR.org, 2015).

Policy advantage theorem: (9)η(π~)=η(π)+Es0ρ0,aπ~(|s),sT(|s,a)[t=1γtAπ(st,at)] Proof is as follows (for ease of writing, omit the expectation subscripts s0ρ0,sT(|s,a), aπ(|s) is abbreviated as aπ): (10)Eaπ~[t=0γtAπ(st,at)](11)=Eaπ~[t=0γt[Qπ(st,at)Vπ(st)]](12)=Eaπ~[t=0γt[rt+γVπ(st+1)Vπ(st)]](13)=Eaπ~[t=0γtrt]+Eaπ~[t=0γt+1Vπ(st+1)]Eaπ~[t=0γtVπ(st)](14)=η(π~)Eaπ~[Vπ(s0)](15)=η(π~)η(π)

Based on the state visitation frequency ρπ, equation ??? can also be written in the following form: (16)η(π~)=η(π)+sρπ~(s)aπ~(a|s)Aπ(s,a) According to equation ???, if for each state s, aπ~(a|s)Aπ(s,a)0, then the new policy is guaranteed to be better than the old policy. However, due to the dependence of equation ??? on ρπ~, it is not directly computable in practical optimization. Therefore, considering a local estimate of η: (17)Lπ(π~)=η(π)+sρπ(s)aπ~(a|s)Aπ(s,a) The difference between Lπ(π~) and η(π~) lies in replacing ρπ~ with ρπ. Next, the analysis of Lπ(π~) is performed. If there is a parameterized policy πθ that is differentiable with respect to θ, then: (18)Lπθ0(πθ0)=η(πθ0)θLπθ0(πθ)|θ=θ0=θη(πθ)|θ=θ0 Therefore, if the update of π is small enough, improving Lπ will also improve η. Considering the following policy update: (19)π~(a|s)=(1α)π(a|s)+απ(a|s) where π=argmaxπLπ(π~), it can be shown (proof in [(Kakade & Langford, 2002)][approximately_optimal]): (20)η(π~)Lπ(π~)2ϵγ(1γ(1α))(1γ)α2 where ϵ=maxs|Eaπ~(|s)Aπ(s,a)|

Continued in the next message...