The details of implementing reinforcement learning algorithms.
This is an automatically translated post by LLM. The original post is in Chinese. If you find any translation errors, please leave a comment to help me improve the translation. Thanks!
Common
- Be cautious when implementing reinforcement learning algorithms, as attention to detail is crucial for convergence and training effectiveness. This article mainly documents some pitfalls encountered and details to be aware of while implementing various reinforcement learning algorithms, with continuous updates...
And here's my self-implemented RL algorithm library: https://github.com/KezhiAdore/RL-Algorithm
Common
- In
PyTorch
, the cross-entropy loss functiontorch.nn.functional.cross_entropy
first computes asoftmax
, which should be noted when using policy gradients 1. - In some scenarios, there's a human-induced premature termination of trajectories (e.g., reaching a certain step in the Cart Pole environment). This needs to be distinguished from termination due to failure. In the latter case, \(q(s,a)=r\), while in human-induced truncation, \(q(s,a)=r+\gamma*V(s')\) 2.
REINFORCE
- When computing the
loss
, ensure consistency in dimensions when multiplying \(-\ln\pi_\theta(a|s)\) with discounted rewards. - After computing the cross-entropy, using
torch.sum
for summation is more effective thantorch.mean
for averaging 3.
DQN Series
- Be mindful when calculating
q_target
, especially whendone=1
, whereq_target = reward + self._gamma * max_next_q_value * (1 - done)
, to prevent significant oscillations in rewards during training. - Ensure there's periodic synchronization of the
target_network
, as convergence is challenging without it.