The details of implementing reinforcement learning algorithms.

This is an automatically translated post by LLM. The original post is in Chinese. If you find any translation errors, please leave a comment to help me improve the translation. Thanks!

Common

  • Be cautious when implementing reinforcement learning algorithms, as attention to detail is crucial for convergence and training effectiveness. This article mainly documents some pitfalls encountered and details to be aware of while implementing various reinforcement learning algorithms, with continuous updates...

And here's my self-implemented RL algorithm library: https://github.com/KezhiAdore/RL-Algorithm

Image

Common

  • In PyTorch, the cross-entropy loss function torch.nn.functional.cross_entropy first computes a softmax, which should be noted when using policy gradients 1.
  • In some scenarios, there's a human-induced premature termination of trajectories (e.g., reaching a certain step in the Cart Pole environment). This needs to be distinguished from termination due to failure. In the latter case, \(q(s,a)=r\), while in human-induced truncation, \(q(s,a)=r+\gamma*V(s')\) 2.

REINFORCE

  • When computing the loss, ensure consistency in dimensions when multiplying \(-\ln\pi_\theta(a|s)\) with discounted rewards.
  • After computing the cross-entropy, using torch.sum for summation is more effective than torch.mean for averaging 3.

DQN Series

  • Be mindful when calculating q_target, especially when done=1, where q_target = reward + self._gamma * max_next_q_value * (1 - done), to prevent significant oscillations in rewards during training.
  • Ensure there's periodic synchronization of the target_network, as convergence is challenging without it.

Reference


  1. Inconsistency in CrossEntropyLoss and Cross-Entropy Calculation in PyTorch | Kezhi's Blog↩︎

  2. Terminated and Truncated in Reinforcement Learning | Kezhi's Blog↩︎

  3. Discussion on Loss Mean vs. Sum in Vanilla Policy Gradient | Kezhi's Blog↩︎