最近在逐一复现RL算法过程中,策略梯度算法的收敛性一直有问题。经过一番探究和对比实验,学习了网上和书本上的很多实验代码之后,发现了代码实现中的一个小问题,即Policy Gradient在计算最终loss时求平均和求和对于网络训练的影响,本文将对此进行展开讨论。

先说结论:

在进行策略梯度下降优化策略时,可以对每个动作的loss逐一(for操作)进行反向传播后进行梯度下降,也可以对每个动作的loss求和(sum操作)之后进行反向传播后梯度下降,但尽量避免对所有动作的loss求平均(mean操作)之后进行反向传播后梯度下降,这会导致收敛速度较慢,甚至无法收敛。

具体的论证和实验过程见下文。

阅读全文 »

This is an automatically translated post by LLM. The original post is in Chinese. If you find any translation errors, please leave a comment to help me improve the translation. Thanks!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
Recently, while systematically reproducing RL algorithms, the convergence of policy gradient algorithms has been problematic. After some exploration and comparative experiments, learning from many experimental codes online and in books, a small issue in the code implementation was discovered: the impact of averaging and summing when calculating the final loss in Policy Gradient on network training. This article will discuss this in detail.

Let's start with the conclusion:

> When optimizing policies using policy gradient descent, it is possible to perform gradient descent after backpropagating the loss of each action one by one (using a `for` loop), or after backpropagating the sum of losses of each action (using a `sum` operation). However, it is advisable to avoid backpropagating after averaging the losses of all actions (using a `mean` operation), as this may lead to slow convergence or even non-convergence.

Detailed reasoning and experimental processes are provided below.

## Introduction to Vanilla Policy Gradient

Vanilla Policy Gradient (VPG, i.e., REINFORCE), translated into Chinese as the policy gradient algorithm, is a classic algorithm for policy optimization. First, let's clarify some symbols:

- $s_t$: the state of the environment at time $t$
- $a_t$: the action chosen by the agent at time $t$
- $\pi_\theta$: the agent's policy, represented by a neural network parameterized by $\theta$ here
- $\pi_\theta(a|s)$: the probability of the agent choosing action $a$ given state $s$
- $r_t$: the reward given by the environment at time $t$
- $\gamma$: the discount factor
- $g_t$: the cumulative discounted reward, $g_t=\sum_{t'=t}^T \gamma^{t'-t}r_t$

VPG roughly consists of two steps: experience collection and policy optimization. The specific implementation of these two parts is as follows:

1. The agent interacts with the environment until an episode terminates, obtaining a series of trajectories $s_1,a_1,r_1,s_2,a_2,r_2...s_n,a_n,r_n$.
2. Update $\theta$: $\theta =\theta +\alpha\nabla_\theta \pi_\theta(a_t|s_t) g_t$

## Why Use Sum and Mean for Loss?

The policy gradient algorithm is a gradient ascent algorithm. To facilitate the use of the current gradient descent framework, $loss = -\nabla_\theta \pi_\theta(a_t|s_t)g_t$ can be adopted.

In the implementation process of VPG, the training part of the neural network uses the `pytorch` framework. In this framework, the `loss` used for backpropagation must be a scalar. The policy gradient algorithm updates the triplet $(s_t,a_t,g_t)$ at each moment with a gradient descent step. That is, multiple backpropagations need to be performed. For this problem, there are mainly three solutions:

1. Backpropagate each triplet $(s_t,a_t,g_t)$ separately and then perform gradient descent. The implementation code is as follows:

[Python code block retained]

2. Sum all calculated losses and then backpropagate. The implementation code is as follows:

[Python code block retained]

3. Average all calculated losses and then backpropagate. The implementation code is as follows:

[Python code block retained]

## Why Using Sum and Mean for Loss Lead to Different Results

Assuming the length of an episode is $n$, there will be $n$ experience triplets $(s_1,a_1,g_1),(s_2,a_2,g_2),...,(s_n,a_n,g_n)$, and let the losses calculated for each triplet be $l_1,l_2,...,l_n$ respectively. Below is an analysis of the impact of sum and mean on backpropagation of loss.

If we sum the losses, i.e., $l_{tot}=l_1+l_2+...+l_n$, according to the chain rule, we have:
$$
\frac{\partial l_{tot}}{\partial \theta}=\frac{\partial l_1}{\partial \theta}+\frac{\partial l_2}{\partial \theta}+...+\frac{\partial l_n}{\partial \theta}
$$
Therefore, using $l_{tot}$ to calculate the gradient and backpropagate is equivalent to accumulating the gradients of each loss separately.

Zotero 是一款非常好用的开源文献管理软件,在对比了Endnote,Mendeley,Zotero之后最终我选择了Zotero作为我自己的文献管理软件,选择其的主要原因有:

  • 支持webdav同步文献记录及pdf附件
  • 丰富的扩展插件
  • 支持markdown笔记

本文主要介绍部分常用的Zotero插件,并附上其下载链接,同时谈谈使用感受。顺序按照我个人认为的好用程度排序。

阅读全文 »

This is an automatically translated post by LLM. The original post is in Chinese. If you find any translation errors, please leave a comment to help me improve the translation. Thanks!

Zotero is a very useful open-source reference management software. After comparing Endnote, Mendeley, and Zotero, I ultimately chose Zotero as my own reference management software for the following reasons:

  • Support for webdav synchronization of reference records and PDF attachments
  • Rich extension plugins
  • Support for markdown notes

This article mainly introduces some commonly used Zotero plugins, along with their download links, and discusses my experience using them. The order is based on what I personally consider to be the most useful.

阅读全文 »

本文围绕自由落体运动的估计,进行了线性滤波和非线性滤波的实验。下面这张图是源自西安交通大学蔡远利教授的《随即滤波与控制》课程。 该课程主要围绕估计,平滑与预测三方面讲解各类滤波方法。

Map of Control Theory
阅读全文 »

This is an automatically translated post by LLM. The original post is in Chinese. If you find any translation errors, please leave a comment to help me improve the translation. Thanks!

This article presents experiments on linear filtering and nonlinear filtering for estimating the free fall motion. The following figure is from Professor Cai Yuanli's course "Random Filtering and Control" at Xi'an Jiaotong University. The course mainly discusses various filtering methods related to estimation, smoothing, and prediction.

Map of Control Theory
阅读全文 »

This is an automatically translated post by LLM. The original post is in Chinese. If you find any translation errors, please leave a comment to help me improve the translation. Thanks!

Intelligent Agent for Target Detection in Game Battles

Manual annotation dataset and video segmentation program download can be found at the end of the article.

Image
阅读全文 »

This is an automatically translated post by LLM. The original post is in Chinese. If you find any translation errors, please leave a comment to help me improve the translation. Thanks!

Application of Branch and Bound Method in Knapsack Problem

Final report for the course "System Optimization and Scheduling" at Xi'an Jiaotong University

阅读全文 »
0%