1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52
| Recently, while systematically reproducing RL algorithms, the convergence of policy gradient algorithms has been problematic. After some exploration and comparative experiments, learning from many experimental codes online and in books, a small issue in the code implementation was discovered: the impact of averaging and summing when calculating the final loss in Policy Gradient on network training. This article will discuss this in detail.
Let's start with the conclusion:
> When optimizing policies using policy gradient descent, it is possible to perform gradient descent after backpropagating the loss of each action one by one (using a `for` loop), or after backpropagating the sum of losses of each action (using a `sum` operation). However, it is advisable to avoid backpropagating after averaging the losses of all actions (using a `mean` operation), as this may lead to slow convergence or even non-convergence.
Detailed reasoning and experimental processes are provided below.
## Introduction to Vanilla Policy Gradient
Vanilla Policy Gradient (VPG, i.e., REINFORCE), translated into Chinese as the policy gradient algorithm, is a classic algorithm for policy optimization. First, let's clarify some symbols:
- $s_t$: the state of the environment at time $t$ - $a_t$: the action chosen by the agent at time $t$ - $\pi_\theta$: the agent's policy, represented by a neural network parameterized by $\theta$ here - $\pi_\theta(a|s)$: the probability of the agent choosing action $a$ given state $s$ - $r_t$: the reward given by the environment at time $t$ - $\gamma$: the discount factor - $g_t$: the cumulative discounted reward, $g_t=\sum_{t'=t}^T \gamma^{t'-t}r_t$
VPG roughly consists of two steps: experience collection and policy optimization. The specific implementation of these two parts is as follows:
1. The agent interacts with the environment until an episode terminates, obtaining a series of trajectories $s_1,a_1,r_1,s_2,a_2,r_2...s_n,a_n,r_n$. 2. Update $\theta$: $\theta =\theta +\alpha\nabla_\theta \pi_\theta(a_t|s_t) g_t$
## Why Use Sum and Mean for Loss?
The policy gradient algorithm is a gradient ascent algorithm. To facilitate the use of the current gradient descent framework, $loss = -\nabla_\theta \pi_\theta(a_t|s_t)g_t$ can be adopted.
In the implementation process of VPG, the training part of the neural network uses the `pytorch` framework. In this framework, the `loss` used for backpropagation must be a scalar. The policy gradient algorithm updates the triplet $(s_t,a_t,g_t)$ at each moment with a gradient descent step. That is, multiple backpropagations need to be performed. For this problem, there are mainly three solutions:
1. Backpropagate each triplet $(s_t,a_t,g_t)$ separately and then perform gradient descent. The implementation code is as follows:
[Python code block retained]
2. Sum all calculated losses and then backpropagate. The implementation code is as follows:
[Python code block retained]
3. Average all calculated losses and then backpropagate. The implementation code is as follows:
[Python code block retained]
## Why Using Sum and Mean for Loss Lead to Different Results
Assuming the length of an episode is $n$, there will be $n$ experience triplets $(s_1,a_1,g_1),(s_2,a_2,g_2),...,(s_n,a_n,g_n)$, and let the losses calculated for each triplet be $l_1,l_2,...,l_n$ respectively. Below is an analysis of the impact of sum and mean on backpropagation of loss.
If we sum the losses, i.e., $l_{tot}=l_1+l_2+...+l_n$, according to the chain rule, we have: $$ \frac{\partial l_{tot}}{\partial \theta}=\frac{\partial l_1}{\partial \theta}+\frac{\partial l_2}{\partial \theta}+...+\frac{\partial l_n}{\partial \theta} $$ Therefore, using $l_{tot}$ to calculate the gradient and backpropagate is equivalent to accumulating the gradients of each loss separately.
|