Discussion on Loss Mean and Sum in Vanilla Policy Gradient

This is an automatically translated post by LLM. The original post is in Chinese. If you find any translation errors, please leave a comment to help me improve the translation. Thanks!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
Recently, while systematically reproducing RL algorithms, the convergence of policy gradient algorithms has been problematic. After some exploration and comparative experiments, learning from many experimental codes online and in books, a small issue in the code implementation was discovered: the impact of averaging and summing when calculating the final loss in Policy Gradient on network training. This article will discuss this in detail.

Let's start with the conclusion:

> When optimizing policies using policy gradient descent, it is possible to perform gradient descent after backpropagating the loss of each action one by one (using a `for` loop), or after backpropagating the sum of losses of each action (using a `sum` operation). However, it is advisable to avoid backpropagating after averaging the losses of all actions (using a `mean` operation), as this may lead to slow convergence or even non-convergence.

Detailed reasoning and experimental processes are provided below.

## Introduction to Vanilla Policy Gradient

Vanilla Policy Gradient (VPG, i.e., REINFORCE), translated into Chinese as the policy gradient algorithm, is a classic algorithm for policy optimization. First, let's clarify some symbols:

- $s_t$: the state of the environment at time $t$
- $a_t$: the action chosen by the agent at time $t$
- $\pi_\theta$: the agent's policy, represented by a neural network parameterized by $\theta$ here
- $\pi_\theta(a|s)$: the probability of the agent choosing action $a$ given state $s$
- $r_t$: the reward given by the environment at time $t$
- $\gamma$: the discount factor
- $g_t$: the cumulative discounted reward, $g_t=\sum_{t'=t}^T \gamma^{t'-t}r_t$

VPG roughly consists of two steps: experience collection and policy optimization. The specific implementation of these two parts is as follows:

1. The agent interacts with the environment until an episode terminates, obtaining a series of trajectories $s_1,a_1,r_1,s_2,a_2,r_2...s_n,a_n,r_n$.
2. Update $\theta$: $\theta =\theta +\alpha\nabla_\theta \pi_\theta(a_t|s_t) g_t$

## Why Use Sum and Mean for Loss?

The policy gradient algorithm is a gradient ascent algorithm. To facilitate the use of the current gradient descent framework, $loss = -\nabla_\theta \pi_\theta(a_t|s_t)g_t$ can be adopted.

In the implementation process of VPG, the training part of the neural network uses the `pytorch` framework. In this framework, the `loss` used for backpropagation must be a scalar. The policy gradient algorithm updates the triplet $(s_t,a_t,g_t)$ at each moment with a gradient descent step. That is, multiple backpropagations need to be performed. For this problem, there are mainly three solutions:

1. Backpropagate each triplet $(s_t,a_t,g_t)$ separately and then perform gradient descent. The implementation code is as follows:

[Python code block retained]

2. Sum all calculated losses and then backpropagate. The implementation code is as follows:

[Python code block retained]

3. Average all calculated losses and then backpropagate. The implementation code is as follows:

[Python code block retained]

## Why Using Sum and Mean for Loss Lead to Different Results

Assuming the length of an episode is $n$, there will be $n$ experience triplets $(s_1,a_1,g_1),(s_2,a_2,g_2),...,(s_n,a_n,g_n)$, and let the losses calculated for each triplet be $l_1,l_2,...,l_n$ respectively. Below is an analysis of the impact of sum and mean on backpropagation of loss.

If we sum the losses, i.e., $l_{tot}=l_1+l_2+...+l_n$, according to the chain rule, we have:
$$
\frac{\partial l_{tot}}{\partial \theta}=\frac{\partial l_1}{\partial \theta}+\frac{\partial l_2}{\partial \theta}+...+\frac{\partial l_n}{\partial \theta}
$$
Therefore, using $l_{tot}$ to calculate the gradient and backpropagate is equivalent to accumulating the gradients of each loss separately.