This page contains a short list of tips and best practices that have been quite useful in our work over the last couple of years and will hopefully also make it easier for you to train your agents. However, you should be aware that not each item below will work in each and every application scenario. Nonetheless, if you are stuck most of them are certainly worth to give a try!

Note

Below you find a subjective and certainly not complete collection of RL tips and tricks that will hopefully continue to grow over time. However, if you stumble upon something crucial that is missing from the list, which you would like to share with the RL community and us do not hesitate to get in touch and discuss with us!

## Learning and Optimization¶

Use action masking whenever possible! This can be crucial as it has the potential to drastically reduce the exploration space of your problem, which usually leads to a reduced learning time and better overall results. In some cases action masking also mitigates the need for reward shaping as invalid actions are excluded from sampling and there is no need to penalize them with negative rewards any more. If you want to learn more we recommend to check out the tutorial on structured environments and action masking.

Reward Scaling and Shaping

Make sure that your step rewards are in a reasonable range (e.g., [-1, 1]) not spanning various orders of magnitude. If these conditions are not fulfilled you might want to apply reward scaling or clipping (see RewardScalingWrapper, RewardClippingWrapper) or manually shape your reward.

Reward and Key Performance Indicator (KPI) Monitoring

When optimizing multi-target objectives (e.g., a weighted sum of sub-rewards) consider to monitor the contributing rewards on an individual basis. Event though the overall reward appears to not improve anymore it might still be the case that the contributing sub-rewards change or fluctuate in the background. This indicates that the policy and in turn the behaviour of your agent is still changing. In such settings we recommend to watch the learning progress by monitoring KPIs.

## Models and Networks¶

Network Design

Design use case and task specific custom network architectures whenever required. In a straight forward case this might be a CNN when processing image observations but it could also be a Graph Convolution Network (GCN) when working with graph or grid observations. To do so, you might want to check out the Perception Module, the built-in network building blocks as well as the section on how to work with custom models.

Further, you might want to consider behavioural cloning (BC) to design and tweak

• the network architectures

• the observations that are fed into these models

This requires that an imitation learning dataset fulfilling the pre-conditions for supervised learning is available. If so, incorporating BC into the model and observation design process can save a lot of time and compute as you are now training in a supervised learning setting. Intuition: If a network architecture, given the corresponding observations, is able to fit an offline trajectory dataset (without severe over-fitting) it might also be a good choice for actual RL training. If this is relevant to you, you can follow up on how to employ imitation learning with Maze.

Continuous Action Spaces

When facing bounded continuous action spaces use Squashed Gaussian or Beta probability distributions for your action heads instead of an unbounded Gaussian. This avoids action clipping and limits the space of explorable actions to valid regions. You can learn in the section about distributions and acton heads how you can easily switch between different probability distributions using the DistributionMapper.

If you would like to incorporate prior knowledge about the selection frequency of certain actions you could consider to bias the output layers of these action heads towards the expected sampling distribution after randomly initializing the weights of your networks (e.g., compute_sigmoid_bias).

## Observations¶

Observation Normalization

For efficient RL training it is crucial that the inputs (e.g. observations) to our models (e.g. policy and value networks) follow a certain distribution and exhibit values within certain ranges. To ensure this precondition consider to normalize your observations before actual training by either:

• manually specifying normalization statistics (e.g, divide by 255 for uint8 RGB image observations)

• compute statistics from observations sampled by interacting with the environment

As this is a recurring, boilerplate code heavy task, Maze already provides built-in customizable functionality for normalizing the observations.

Observation Pre-Processing

When feeding categorical observations to your models consider to convert them to their one-hot encoded vectorized counterparts. This representation is better suited for neural network processing and a common practice for example in Natural Language Processing (NLP). In Maze you can achieve this via observation pre-processing and the OneHotPreProcessor.