Policy Optimization
This is an automatically translated post by LLM. The original post is in Chinese. If you find any translation errors, please leave a comment to help me improve the translation. Thanks!
This article mainly reviews and summarizes reinforcement learning algorithms based on policy optimization theorems and related variants.
Preliminaries
A Markov decision process (MDP) can be defined by a five-tuple
represents the state space, a set of states represents the action space, a set of actions represents the reward function, is the reward obtained by taking action in state represents the state transition probability function, is the probability of transitioning from state to state by taking action represents the discount factor represents the initial state distribution
The decision process of an agent is represented by a stochastic
policy
Advantage function:
Policy gradient theorem:
Approximately Optimal Approximate Reinforcement Learning
Kakade, S. & Langford, J. Approximately Optimal Approximate Reinforcement Learning. in Proceedings of the Nineteenth International Conference on Machine Learning 267–274 (Morgan Kaufmann Publishers Inc., 2002).
This paper addresses three questions:
- Is there a performance metric that guarantees improvement at each update step?
- How difficult is it to verify that an update improves this performance metric?
- What level of performance can be achieved after a reasonable number of policy updates?
Consider the following conservative policy update rule:
to be continue......
TRPO
Schulman, J., Levine, S., Moritz, P., Jordan, M. & Abbeel, P. Trust region policy optimization. in Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37 1889–1897 (JMLR.org, 2015).
Policy advantage theorem:
Based on the state visitation frequency
Continued in the next message...
预览: