Maze: Applied Reinforcement Learning with Python¶
Getting Started¶
Installation¶
To install Maze with pip, run:
pip install -U maze-rl
Note
Pip does not install PyTorch, you need to make sure it is available in your Python environment.
Note
Maze is compatible with Python 3.7 to 3.9. We encourage you to start with Python 3.7, as many popular environments like Atari or Box2D can not easily be installed in newer Python environments. If you use a Python 3.9 environment, you might need to install a few additional dependencies because of this OpenAI gym issue (for Debian systems sudo apt install libjpeg8-dev zlib1g-dev, more info on building pillow)
If you want to use RLLib it in combination with Maze, optionally install it with:
pip install ray[rllib]==1.4.1 tensorflow
(Installing RLlib is only required if you would like to use the Maze RLlib Runner)
To install the bleeding-edge development version from github, first clone the repo.
git clone https://github.com/enlite-ai/maze.git
cd maze
Finally, install the project with pip in development mode and you are ready to start developing.
pip install -e .
Alternatively you can install with pip directly from the GitHub repository
pip install git+https://github.com/enlite-ai/maze.git
A First Example¶
This example shows how to train and rollout a policy for the CartPole environment with A2C. It also gives a small glimpse into the Maze framework.
Training and Rollouts¶
To train a policy with the synchronous advantage actor-critic (A2C), run:
$ maze-run -cn conf_train env.name=CartPole-v0 algorithm=a2c algorithm.n_epochs=5
rc = RunContext(algorithm="a2c", overrides={"env.name": "CartPole-v0"})
rc.train(n_epochs=5)
All training outputs including model weights will be stored in
outputs/<exp-dir>/<time-stamp>
(for example outputs/gym_env-flatten_concat-a2c-None-local/2021-02-23_23-09-25/
).
To perform rollouts for evaluating the trained policy, run:
$ maze-run env.name=CartPole-v0 policy=torch_policy input_dir=outputs/<exp-dir>/<time-stamp>
env = GymMazeEnv(env=cartpole_env)
obs = env.reset()
for i in range(10):
action = rc.compute_action(obs)
obs, rewards, dones, info = env.step(action)
This performs 50 rollouts and prints the resulting performance statistics to the command line:
step|path | value
=====|==================================================================|==================
1|rollout_stats DiscreteActionEvents action substep_0/action | [len:7900, μ:0.5]
1|rollout_stats BaseEnvEvents reward median_step_count | 157.500
1|rollout_stats BaseEnvEvents reward mean_step_count | 158.000
1|rollout_stats BaseEnvEvents reward total_step_count | 7900.000
1|rollout_stats BaseEnvEvents reward total_episode_count | 50.000
1|rollout_stats BaseEnvEvents reward episode_count | 50.000
1|rollout_stats BaseEnvEvents reward std | 31.843
1|rollout_stats BaseEnvEvents reward mean | 158.000
1|rollout_stats BaseEnvEvents reward min | 83.000
1|rollout_stats BaseEnvEvents reward max | 200.000
To see the policy directly in action you can also perform sequential rollouts with rendering:
$ maze-run env.name=CartPole-v0 policy=torch_policy input_dir=outputs/<exp-dir>/<time-stamp> \
runner=sequential runner.render=true
Note
Managed rollouts are not yet fully supported by our Python API, but will follow soon.

Tensorboard¶
To watch the training progress with Tensorboard start it by running:
tensorboard --logdir outputs/
and view it with your browser at http://localhost:6006/.

Training Outputs¶
For easier reproducibility Maze writes the full configuration compiled with Hydra to the command line an preserves it in the TEXT tab of Tensorboard along with the original training command.
algorithm:
critic_burn_in_epochs: 0
deterministic_eval: false
device: cpu
entropy_coef: 0.00025
epoch_length: 25
eval_repeats: 2
gae_lambda: 1.0
gamma: 0.98
lr: 0.0005
max_grad_norm: 0.0
n_epochs: 5
n_rollout_steps: 100
patience: 15
policy_loss_coef: 1.0
value_loss_coef: 0.5
env:
_target_: maze.core.wrappers.maze_gym_env_wrapper.make_gym_maze_env
name: CartPole-v0
input_dir: ''
log_base_dir: outputs
model:
...
You will also find PDFs showing the inference graphs of the policy and critic networks in the experiment output directory. This turns out to be extremely useful when playing around with model architectures or when returning to experiments at a later stage.


Maze - Step by Step¶
This tutorial provides a step by step guide explaining how to implement your own Maze environment and get the best out of its features. We will do this based on the online version of the Guillotine 2D Cutting Stock Problem as it is still relatively simple but exhibits the required problem structure to introduce all relevant Maze concepts.
Before diving into this tutorial we recommend to read up on the Maze Environment Hierarchy. You can of course also do this along the way following the provided links to explanations of the required concepts when we get there.
The remainder of this tutorial is structured as follows:
Cutting-2D Problem Specification¶
This page introduces the problem we would like to address with a Deep Reinforcement Learning agent: an online version of the Guillotine 2D Cutting Stock Problem.
Description of Problem:

In each step there is one new incoming customer order generated according to a certain demand pattern.
This customer order has to be fulfilled by cutting the exact x/y-dimensions from a set of available candidate pieces in the inventory.
A new raw piece is transferred to the inventory every time the current raw piece in inventory is used to cut and deliver a customer order.
The goal is to use as few raw pieces as possible throughout the episode, which can be achieved by following a clever cutting policy.
Agent-Environment Interaction Loop:
To make the problem more explicit from an RL perspective we formulate it according to the agent-environment interaction loop shown below.

The State contains the dimensions of the currently pending customer orders and all pieces on inventory.
The Reward is specified to discourage the usage of raw inventory pieces.
The Action is a joint action consisting of the following components (see image below for details):
Action \(a_0\): Cutting piece selection (decides which piece from inventory to use for cutting)
Action \(a_1\): Cutting orientation selection (decides the orientation of the customer)
Action \(a_2\): Cutting order selection (decides which cut to take first; x or y)

Given this description of the problem we will now proceed with implementing a corresponding simulation.
Implementing the CoreEnv¶
The complete code for this part of the tutorial can be found here
# file structure
- cutting_2d
- main.py
- env
- core_env.py
- inventory.py
- maze_state.py
- maze_action.py
Page Overview
CoreEnv¶
The first component we need to implement is the Core Environment which defines the main mechanics and functionality of the environment.
For this example we will call it Cutting2DCoreEnvironment
.
As for any other Gym environment we need to implement several methods according to the
CoreEnv
interface.
We will start with the very basic components and add more and more features (complexity) throughout this tutorial:
step()
: Implements the cutting mechanics.reset()
: Resets the environment as well as the piece inventory.seed()
: Sets the random state of the environment for reproducibility.close()
: Can be used for cleanup.get_maze_state()
: Returns the current MazeState of the environment.
You can find the implementation of the basic version of the Cutting2DCoreEnvironment
below.
from typing import Union, Tuple, Dict, Any
import numpy as np
from maze.core.env.core_env import CoreEnv
from maze.core.env.structured_env import ActorID
from .maze_state import Cutting2DMazeState
from .maze_action import Cutting2DMazeAction
from .inventory import Inventory
class Cutting2DCoreEnvironment(CoreEnv):
"""Environment for cutting 2D pieces based on the customer demand. Works as follows:
- Keeps inventory of 2D pieces available for cutting and fulfilling the demand.
- Produces a new demand for one piece in every step (here a static demand).
- The agent should decide which piece from inventory to cut (and how) to fulfill the given demand.
- What remains from the cut piece is put back in inventory.
- All the time, one raw (full-size) piece is available in inventory.
(If it gets cut, it is replenished in the next step.)
- Rewards are calculated to motivate the agent to consume as few raw pieces as possible.
- If inventory gets full, the oldest pieces get discarded.
:param max_pieces_in_inventory: Size of the inventory.
:param raw_piece_size: Size of a fresh raw (= full-size) piece.
:param static_demand: Order to issue in each step.
"""
def __init__(self, max_pieces_in_inventory: int, raw_piece_size: (int, int),
static_demand: (int, int)):
super().__init__()
self.max_pieces_in_inventory = max_pieces_in_inventory
self.raw_piece_size = tuple(raw_piece_size)
self.current_demand = static_demand
# setup environment
self._setup_env()
def _setup_env(self):
"""Setup environment."""
self.inventory = Inventory(self.max_pieces_in_inventory, self.raw_piece_size)
self.inventory.replenish_piece()
def step(self, maze_action: Cutting2DMazeAction) \
-> Tuple[Cutting2DMazeState, np.array, bool, Dict[Any, Any]]:
"""Summary of the step (simplified, not necessarily respecting the actual order in the code):
1. Check if the selected piece to cut is valid (i.e. in inventory, large enough etc.)
2. Attempt the cutting
3. Replenish a fresh piece if needed and return an appropriate reward
:param maze_action: Cutting MazeAction to take.
:return: maze_state, reward, done, info
"""
info, reward = {}, 0
replenishment_needed = False
# check if valid piece id was selected
if maze_action.piece_id >= self.inventory.size():
info['error'] = 'piece_id_out_of_bounds'
# perform cutting
else:
piece_to_cut = self.inventory.pieces[maze_action.piece_id]
# attempt the cut
if self.inventory.cut(maze_action, self.current_demand):
info['msg'] = "valid_cut"
replenishment_needed = piece_to_cut == self.raw_piece_size
else:
# assign a negative reward for invalid cutting attempts
info['error'] = "invalid_cut"
reward = -2
# check if replenishment is required
if replenishment_needed:
self.inventory.replenish_piece()
# assign negative reward if a piece has to be replenished
reward = -1
# compile env state
maze_state = self.get_maze_state()
return maze_state, reward, False, info
def get_maze_state(self) -> Cutting2DMazeState:
"""Returns the current Cutting2DMazeState of the environment."""
return Cutting2DMazeState(self.inventory.pieces, self.max_pieces_in_inventory,
self.current_demand, self.raw_piece_size)
def reset(self) -> Cutting2DMazeState:
"""Resets the environment to initial state."""
self._setup_env()
return self.get_maze_state()
def close(self):
"""No additional cleanup necessary."""
def seed(self, seed: int) -> None:
"""Seed random state of environment."""
# No randomness in the env at this point
pass
# --- lets ignore everything below this line for now ---
def get_renderer(self) -> Any:
pass
def get_serializable_components(self) -> Dict[str, Any]:
pass
def is_actor_done(self) -> bool:
pass
def actor_id(self) -> ActorID:
pass
def agent_counts_dict(self) -> Dict[Union[str, int], int]:
pass
Environment Components¶
To keep the implementation of the core environment short and clean
we introduces a dedicated Inventory
class providing functionality for:
maintaining the inventory of available cutting pieces
replenishing new raw inventory pieces if required
the cutting logic of the environment
from .maze_action import Cutting2DMazeAction
class Inventory:
"""Holds the inventory of 2D pieces and performs cutting.
:param max_pieces_in_inventory: Size of the inventory. If full, the oldest pieces get discarded.
:param raw_piece_size: Size of a fresh raw (= full-size) piece.
"""
def __init__(self, max_pieces_in_inventory: int, raw_piece_size: (int, int)):
self.max_pieces_in_inventory = max_pieces_in_inventory
self.raw_piece_size = raw_piece_size
self.pieces = []
# == Inventory management ==
def is_full(self) -> bool:
"""Checks weather all slots in the inventory are in use."""
return len(self.pieces) == self.max_pieces_in_inventory
def store_piece(self, piece: (int, int)) -> None:
"""Store the given piece.
:param piece: Piece to store.
"""
# If we would run out of storage space, discard the oldest piece first
if self.is_full():
self.pieces.pop(0)
self.pieces.append(piece)
def replenish_piece(self) -> None:
"""Add a fresh raw piece to inventory."""
self.store_piece(self.raw_piece_size)
# == Cutting ==
def cut(self, maze_action: Cutting2DMazeAction, ordered_piece: (int, int)) -> bool:
"""Attempt to perform the cutting. Remains of the cut piece are put back to inventory.
:param maze_action: the cutting maze_action to perform
:param ordered_piece: Dimensions of the piece that we should produce
:return True if the cutting was successful, False on error.
"""
if maze_action.rotate:
ordered_piece = ordered_piece[::-1]
# Check the piece ID is valid
if maze_action.piece_id >= len(self.pieces):
return False
# Check whether the cut is possible
if any([ordered_piece[dim] > available_size for dim, available_size
in enumerate(self.pieces[maze_action.piece_id])]):
return False
# Perform the cut
cutting_order = [1, 0] if maze_action.reverse_cutting_order else [0, 1]
piece_to_cut = list(self.pieces.pop(maze_action.piece_id))
for dim in cutting_order:
residual = piece_to_cut.copy()
residual[dim] = piece_to_cut[dim] - ordered_piece[dim]
piece_to_cut[dim] = ordered_piece[dim]
if residual[dim] > 0:
self.store_piece(tuple(residual))
return True
# == State representation ==
def size(self) -> int:
"""Current size of the inventory."""
return len(self.pieces)
MazeState and MazeAction¶
As motivated and explained in more detail in our tutorial on Customizing Core and Maze Envs CoreEnvs rely on MazeState and MazeAction objects for interacting with an agent.
For the present case this is a Cutting2DMazeState
class Cutting2DMazeState:
"""Cutting 2D environment MazeState representation.
:param inventory: A list of pieces in inventory.
:param max_pieces_in_inventory: Max number of pieces in inventory (inventory size).
:param current_demand: Piece that should be produced in the next step.
:param raw_piece_size: Size of a raw piece.
"""
def __init__(self, inventory: [(int, int)], max_pieces_in_inventory: int,
current_demand: (int, int), raw_piece_size: (int, int)):
self.inventory = inventory.copy()
self.max_pieces_in_inventory = max_pieces_in_inventory
self.current_demand = current_demand
self.raw_piece_size = raw_piece_size
and a Cutting2DMazeAction
defining which inventory piece
to cut in which cutting order and orientation.
class Cutting2DMazeAction:
"""Environment cutting MazeAction object.
:param piece_id: ID of the piece to cut.
:param rotate: Whether to rotate the ordered piece.
:param reverse_cutting_order: Whether to cut along Y axis first (not X first as normal).
"""
def __init__(self, piece_id: int, rotate: bool, reverse_cutting_order: bool):
self.piece_id = piece_id
self.rotate = rotate
self.reverse_cutting_order = reverse_cutting_order
These two classes are utilized in the CoreEnv code above.
Test Script¶
The following snippet will instantiate the environment and run it for 15 steps.
""" Test script CoreEnv """
from tutorial_maze_env.part01_core_env.env.core_env import Cutting2DCoreEnvironment
from tutorial_maze_env.part01_core_env.env.maze_action import Cutting2DMazeAction
def main():
# init and reset core environment
core_env = Cutting2DCoreEnvironment(max_pieces_in_inventory=200, raw_piece_size=[100, 100],
static_demand=(30, 15))
maze_state = core_env.reset()
# run interaction loop
for i in range(15):
# create cutting maze_action
maze_action = Cutting2DMazeAction(piece_id=0, rotate=False, reverse_cutting_order=False)
# take actual environment step
maze_state, reward, done, info = core_env.step(maze_action)
print(f"reward {reward} | done {done} | info {info}")
if __name__ == "__main__":
""" main """
main()
When running the script you should get the following command line output:
reward -1 | done False | info {'msg': 'valid_cut'}
reward 0 | done False | info {'msg': 'valid_cut'}
reward 0 | done False | info {'msg': 'valid_cut'}
...
Adding a Renderer¶
The complete code for this part of the tutorial can be found here
# file structure
- cutting_2d
- main.py # modified
- env
- core_env.py # modified
- inventory.py
- maze_state.py
- maze_action.py
- renderer.py # new
Page Overview
Renderer¶
To check whether our implementation of the environment works as expected and to later on observe
how trained agents behave we add a Renderer
as a next step in this tutorial.
For implementing the renderer we rely on matplotlib to ensure that it is compatible with the Maze Rollout Visualization Tools.
The Cutting2DRenderer
will show the selected piece (the MazeAction) on the left,
along with the current MazeState of the inventory on the right as shown here.
from typing import Tuple, Optional
import numpy as np
import matplotlib.patches as patches
import matplotlib.pyplot as plt
from maze.core.annotations import override
from maze.core.log_events.step_event_log import StepEventLog
from maze.core.rendering.renderer import Renderer
from .maze_action import Cutting2DMazeAction
from .maze_state import Cutting2DMazeState
class Cutting2DRenderer(Renderer):
"""Rendering class for the 2D cutting env.
The ``Cutting2DRenderer`` will show the selected piece (the maze_action) on the left,
plus the current state of the inventory on the right
"""
@override(Renderer)
def render(self, maze_state: Cutting2DMazeState, maze_action: Optional[Cutting2DMazeAction], events: StepEventLog) -> None:
"""
Render maze_state and maze_action of the cutting 2D env.
:param maze_state: MazeState to render
:param maze_action: MazeAction to render
:param events: Events logged during the step (not used)
"""
plt.figure("Cutting 2D", figsize=(8, 4))
plt.clf()
# The maze_action taken
plt.subplot(121, aspect='equal')
if maze_action is not None:
self._plot_maze_action(maze_action, "MazeAction", maze_state)
else:
self._add_title("MazeAction (none)")
# The inventory state
plt.subplot(122, aspect='equal')
self._plot_inventory(maze_state, maze_action)
plt.tight_layout()
plt.draw()
plt.pause(0.1)
def _plot_maze_action(self, maze_action: Cutting2DMazeAction, title: str, maze_state: Cutting2DMazeState):
piece_to_cut = maze_state.inventory[maze_action.piece_id]
if maze_action.rotate:
piece_to_cut = piece_to_cut[::-1]
plt.xlim([0, maze_state.raw_piece_size[0]])
plt.ylim([0, maze_state.raw_piece_size[1]])
self._draw_piece(piece_to_cut)
self._draw_piece(maze_state.current_demand, highlight=True)
self._draw_cutting_lines(maze_state.current_demand, piece_to_cut, maze_action.reverse_cutting_order)
self._add_title(title)
def _plot_inventory(self, maze_state: Cutting2DMazeState, maze_action: Cutting2DMazeAction):
# plot inventory pieces
inventory_piece_dims = np.vstack(maze_state.inventory)
inventory_piece_dims = np.sort(inventory_piece_dims, axis=1)
plt.plot(inventory_piece_dims[:, 0], inventory_piece_dims[:, 1], "ko",
alpha=0.5, label="inventory pieces")
# plot current demand
current_demand = sorted(maze_state.current_demand)
plt.plot(current_demand[0], current_demand[1], "o",
color=(0.7, 0.2, 0.2), alpha=0.75, label="current demand")
# plot maze_action
piece_to_cut = inventory_piece_dims[maze_action.piece_id]
plt.plot(piece_to_cut[0], piece_to_cut[1], "bo",
alpha=0.75, label="cutting inventory piece")
plt.grid()
plt.legend()
plt.axis("equal")
self._add_title("Inventory Pieces")
@staticmethod
def _draw_piece(piece: Tuple[int, int], highlight: bool = False):
plt.gca().add_patch(patches.Rectangle((0, 0), piece[0], piece[1],
facecolor=(0.7, 0.2, 0.2) if highlight else (0.8, 0.8, 0.8)))
@staticmethod
def _add_title(title: str):
plt.title(title, fontdict=dict(fontsize=16, fontweight='bold', horizontalalignment='left'), loc='left')
@staticmethod
def _draw_cutting_lines(ordered_piece: Tuple[int, int], piece_to_cut: Tuple[int, int], reverse_cutting_order: bool):
"""Draw the cutting lines.
:param ordered_piece: Size of the ordered piece
:param piece_to_cut: Piece which we are cutting
:param reverse_cutting_order: If we should cut along Y axis first (instead of X first)
"""
if reverse_cutting_order:
h_x = (0, piece_to_cut[0])
h_y = (ordered_piece[1], ordered_piece[1])
v_x = (ordered_piece[0], ordered_piece[0])
v_y = (0, ordered_piece[1])
else:
h_x = (0, ordered_piece[0])
h_y = (ordered_piece[1], ordered_piece[1])
v_x = (ordered_piece[0], ordered_piece[0])
v_y = (0, piece_to_cut[1])
plt.plot(h_x, h_y, color='black', linestyle="--")
plt.plot(v_x, v_y, color='black', linestyle="--")
Updating the CoreEnv¶
To make use of the renderer we simple have to instantiate it in the constructor of the CoreEnv and
make it accessible via the get_renderer()
method.
from .renderer import Cutting2DRenderer
...
class Cutting2DCoreEnvironment(CoreEnv):
def __init__(self, max_pieces_in_inventory: int, raw_piece_size: (int, int), static_demand: (int, int)):
super().__init__()
# initialize rendering
self.renderer = Cutting2DRenderer()
...
def get_renderer(self) -> Cutting2DRenderer:
"""Cutting 2D renderer module."""
return self.renderer
Test Script¶
The following snippet will instantiate the environment and run it for 15 steps.
""" Test script CoreEnv """
from tutorial_maze_env.part02_renderer.env.core_env import Cutting2DCoreEnvironment
from tutorial_maze_env.part02_renderer.env.maze_action import Cutting2DMazeAction
def main():
# init and reset core environment
core_env = Cutting2DCoreEnvironment(max_pieces_in_inventory=200, raw_piece_size=[100, 100],
static_demand=(30, 15))
maze_state = core_env.reset()
# run interaction loop
for i in range(15):
# create cutting maze_action
maze_action = Cutting2DMazeAction(piece_id=0, rotate=False, reverse_cutting_order=False)
# render current state along with next maze_action
core_env.renderer.render(maze_state, maze_action, None)
# take actual environment step
maze_state, reward, done, info = core_env.step(maze_action)
print(f"reward {reward} | done {done} | info {info}")
if __name__ == "__main__":
""" main """
main()
When running the script you should get the following command line output:
reward -1 | done False | info {'msg': 'valid_cut'}
reward 0 | done False | info {'msg': 'valid_cut'}
reward 0 | done False | info {'msg': 'valid_cut'}
...
and a rendering of the current MazeState and MazeAction in each time step similar to the image shown below:

The dashed line represents the cutting configuration specified with the MazeAction.
Implementing the MazeEnv¶
The complete code for this part of the tutorial can be found here
# file structure
- cutting_2d
- main.py # modified
- env
- core_env.py # modified
- inventory.py
- maze_state.py
- maze_action.py
- renderer.py
- maze_env.py # new
- space_interfaces
- dict_action_conversion.py # new
- dict_observation_conversion.py # new
Page Overview
MazeEnv¶
The MazeEnv wraps the CoreEnvs as a Gym-style environment in a reusable form, by utilizing the interfaces (mappings) from the MazeState to the observation and from the MazeAction to the action. After implementing the MazeEnv we will be ready to perform our first training run. To learn more about the usability and advantages of this concept you can follow up on Customizing Core and Maze Envs.
In the remainder of this part of the tutorial we will implement the Cutting2DEnvironment
(MazeEnv)
as well as a corresponding set of interfaces.
from maze.core.env.core_env import CoreEnv
from maze.core.env.maze_env import MazeEnv
from maze.core.env.action_conversion import ActionConversionInterface
from maze.core.env.observation_conversion import ObservationConversionInterface
from .core_env import Cutting2DCoreEnvironment
from ..space_interfaces.dict_observation_conversion import ObservationConversion
from ..space_interfaces.dict_action_conversion import ActionConversion
class Cutting2DEnvironment(MazeEnv[Cutting2DCoreEnvironment]):
"""Maze environment for 2d cutting.
:param core_env: The underlying core environment.
:param action_conversion: A action conversion interfaces.
:param observation_conversion: An observation conversion interface.
"""
def __init__(self,
core_env: CoreEnv,
action_conversion: ActionConversionInterface,
observation_conversion: ObservationConversionInterface):
super().__init__(core_env=core_env,
action_conversion_dict={0: action_conversion},
observation_conversion_dict={0: observation_conversion})
def maze_env_factory(max_pieces_in_inventory: int, raw_piece_size: (int, int),
static_demand: (int, int)) -> Cutting2DEnvironment:
"""Convenience factory function that compiles a trainable maze environment.
(for argument details see: Cutting2DCoreEnvironment)
"""
# init core environment
core_env = Cutting2DCoreEnvironment(max_pieces_in_inventory=max_pieces_in_inventory,
raw_piece_size=raw_piece_size,
static_demand=static_demand)
# init maze environment including observation and action interfaces
action_conversion = ActionConversion(max_pieces_in_inventory=max_pieces_in_inventory)
observation_conversion = ObservationConversion(raw_piece_size=raw_piece_size,
max_pieces_in_inventory=max_pieces_in_inventory)
return Cutting2DEnvironment(core_env, action_conversion, observation_conversion)
The MazeEnv is instantiated with the underlying CoreEnv and the two interfaces for MazeStates and MazeActions.
For convenience we also add a maze_env_factory
to instantiate the MazeEnv from the original environment parameter
set. This will be useful in the next part of the tutorial where we will train an agent based on this environment.
ObservationConversionInterface¶
The ObservationConversionInterface
converts CoreEnv MazeState objects into machine readable Gym-style observations
and defines the respective Gym observation space.
In the present cases the observation is defined as a dictionary with the following structure:
inventory: 2d array representing all pieces currently in inventory
inventory_size: count of pieces currently in inventory
order: 2d vector representing the customer order (current demand)
import numpy as np
from typing import Dict
from gym import spaces
from maze.core.annotations import override
from maze.core.env.observation_conversion import ObservationConversionInterface
from ..env.maze_state import Cutting2DMazeState
class ObservationConversion(ObservationConversionInterface):
"""Cutting 2d environment state to dictionary observation.
:param max_pieces_in_inventory: Size of the inventory. If inventory gets full, the oldest pieces get discarded.
:param raw_piece_size: Size of a fresh raw (= full-size) piece
"""
def __init__(self, raw_piece_size: (int, int), max_pieces_in_inventory: int):
self.max_pieces_in_inventory = max_pieces_in_inventory
self.raw_piece_size = raw_piece_size
@override(ObservationConversionInterface)
def maze_to_space(self, maze_state: Cutting2DMazeState) -> Dict[str, np.ndarray]:
"""Converts core environment state to a machine readable agent observation."""
# Convert inventory to numpy array and stretch it to full size (filling with zeros)
inventory_state = maze_state.inventory
inventory_state += [(0, 0)] * (self.max_pieces_in_inventory - len(maze_state.inventory))
# Compile dict space observation
return {'inventory': np.asarray(inventory_state, dtype=np.float32),
'inventory_size': np.asarray([len(maze_state.inventory)], dtype=np.float32),
'ordered_piece': np.asarray(maze_state.current_demand, dtype=np.float32)}
@override(ObservationConversionInterface)
def space_to_maze(self, observation: Dict[str, np.ndarray]) -> Cutting2DMazeState:
"""Converts agent observation to core environment state (not required for this example)."""
raise NotImplementedError
@override(ObservationConversionInterface)
def space(self) -> spaces.Dict:
"""Return the Gym dict observation space based on the given params.
:return: Gym space object
- inventory: max_pieces_in_inventory x 2 (x/y-dimensions of pieces in inventory)
- inventory_size: scalar number of pieces in inventory
- ordered_piece: 2d vector holding x/y-dimension of customer ordered piece
"""
return spaces.Dict({
'inventory': spaces.Box(low=np.zeros((self.max_pieces_in_inventory, 2), dtype=np.float32),
high=np.vstack([[self.raw_piece_size[0] + 1, self.raw_piece_size[1] + 1]] *
self.max_pieces_in_inventory).astype(np.float32),
dtype=np.float32),
'inventory_size': spaces.Box(low=np.float32(0), high=self.max_pieces_in_inventory + 1,
shape=(1,), dtype=np.float32),
'ordered_piece': spaces.Box(low=np.float32(0), high=np.float32(max(self.raw_piece_size) + 1),
shape=(2,), dtype=np.float32)
})
ActionConversionInterface¶
The ActionConversionInterface
converts agent actions into CoreEnv MazeAction objects
and defines the respective Gym action space.
In the present cases the action is defined as a dictionary with the following structure:
piece_idx: id of the inventory piece that should be used for cutting
rotation: defines whether to rotate the piece for cutting or not
order: defines the cutting order (xy vs. yx)
from typing import Dict
from gym import spaces
from maze.core.env.action_conversion import ActionConversionInterface
from ..env.maze_action import Cutting2DMazeAction
from ..env.maze_state import Cutting2DMazeState
class ActionConversion(ActionConversionInterface):
"""Converts agent actions to actual environment maze_actions.
:param max_pieces_in_inventory: Size of the inventory
"""
def __init__(self, max_pieces_in_inventory: int):
self.max_pieces_in_inventory = max_pieces_in_inventory
def space_to_maze(self, action: Dict[str, int], maze_state: Cutting2DMazeState) -> Cutting2DMazeAction:
"""Converts agent dictionary action to environment MazeAction object."""
return Cutting2DMazeAction(piece_id=action["piece_idx"],
rotate=bool(action["cut_rotation"]),
reverse_cutting_order=bool(action["cut_order"]))
def maze_to_space(self, maze_action: Cutting2DMazeAction) -> Dict[str, int]:
"""Converts environment MazeAction object to agent dictionary action."""
return {"piece_idx": maze_action.piece_id,
"cut_rotation": int(maze_action.rotate),
"cut_order": int(maze_action.reverse_cutting_order)}
def space(self) -> spaces.Dict:
"""Returns Gym dict action space."""
return spaces.Dict({
"piece_idx": spaces.Discrete(self.max_pieces_in_inventory), # Which piece should be cut
"cut_rotation": spaces.Discrete(2), # Rotate: (yes / no)
"cut_order": spaces.Discrete(2) # Cutting order: (xy / yx)
})
Updating the CoreEnv¶
For the sake of completeness we also show two more minor modifications required in the CoreEnv,
which are not too important for this tutorial at the moment.
In short, the StructuredEnv
interface supports interaction patterns
beyond standard Gym environments to model for example hierarchical or multi-agent RL problems.
We will get back to this in our more advanced tutorials.
The code below defines that the current version of the environment requires only one actor (id 0) with a single policy (id 0) that is never done.
from maze.core.env.structured_env import ActorID
class Cutting2DCoreEnvironment(CoreEnv):
...
def is_actor_done(self) -> bool:
"""Returns True if the just stepped actor is done, which is different to the done flag of the environment."""
return False
def actor_id(self) -> ActorID:
"""Returns the currently executed actor along with the policy id. The id is unique only with
respect to the policies (every policy has its own actor 0).
Note that identities of done actors can not be reused in the same rollout.
:return: The current actor, as tuple (policy id, actor number).
"""
return ActorID(step_key=0, agent_id=0)
...
Test Script¶
The following snippet will instantiate the environment and run it for 15 steps.
Note that (compared to the previous example) we are now:
working with observations and actions instead of MazeStates and MazeActions
able to sample actions from the action_space object
""" Test script CoreEnv """
from tutorial_maze_env.part03_maze_env.env.maze_env import maze_env_factory
def main():
# init maze environment including observation and action interfaces
env = maze_env_factory(max_pieces_in_inventory=10,
raw_piece_size=[100, 100],
static_demand=(30, 15))
# reset environment
obs = env.reset()
# run interaction loop
for i in range(15):
# sample random action
action = env.action_space.sample()
# take actual environment step
obs, reward, done, info = env.step(action)
print(f"reward {reward} | done {done} | info {info}")
if __name__ == "__main__":
""" main """
main()
reward -1 | done False | info {'msg': 'valid_cut'}
reward 0 | done False | info {'msg': 'valid_cut'}
reward 0 | done False | info {'msg': 'valid_cut'}
reward 0 | done False | info {'error': 'piece_id_out_of_bounds'}
reward 0 | done False | info {'error': 'piece_id_out_of_bounds'}
...
Training the MazeEnv¶
The complete code for this part of the tutorial can be found here
# file structure
- cutting_2d
- main.py
- env ...
- space_interfaces ...
- conf
- env
- tutorial_cutting_2d_basic.yaml # new
- model
- tutorial_cutting_2d_basic.yaml # new
- wrappers
- tutorial_cutting_2d_basic.yaml # new
Note
Hydra only accepts .yaml as file extension.
Page Overview
Hydra Configuration¶
The entire Maze workflow is boosted by the Hydra configuration system. To be able to perform our first training run via the Maze CLI we have to add a few more config files. Going into the very details of the config structure is for now beyond the scope of this tutorial. However, we still provide some information on the parts relevant for this example.
The config file for the maze_env_factory
looks as follows:
# @package env
_target_: tutorial_maze_env.part03_maze_env.env.maze_env.maze_env_factory
# parametrizes the core environment
max_pieces_in_inventory: 200
raw_piece_size: [100, 100]
static_demand: [30, 15]
Additionally, we also provide a wrapper config but refer to Customizing Environments with Wrappers for details.
# @package wrappers
# limits the maximum number of time steps of an episode
maze.core.wrappers.time_limit_wrapper.TimeLimitWrapper:
max_episode_steps: 200
# flattens the dictionary observations to work with DenseLayers
maze.core.wrappers.observation_preprocessing.preprocessing_wrapper.PreProcessingWrapper:
pre_processor_mapping:
- observation: inventory
_target_: maze.preprocessors.FlattenPreProcessor
keep_original: false
config:
num_flatten_dims: 2
# monitoring wrapper
maze.core.wrappers.monitoring_wrapper.MazeEnvMonitoringWrapper:
observation_logging: false
action_logging: true
reward_logging: false
To learn more about the model config in conf/env_model/tutorial_cutting_2d_basic.yaml
you can visit the introduction on how to work with template models.
Training an Agent¶
Once the config is set up we are good to go to start our first training run (in the cmd below with the PPO algorithm) via the CLI with
maze-run -cn conf_train env=tutorial_cutting_2d_basic wrappers=tutorial_cutting_2d_basic \
model=tutorial_cutting_2d_basic algorithm=ppo
rc = RunContext(
env="tutorial_cutting_2d_basic",
wrappers="tutorial_cutting_2d_basic",
model="tutorial_cutting_2d_basic",
algorithm="ppo"
)
rc.train()
Running the trainer should print a command line output similar to the one shown below.
step|path | value
=====|============================================================================|====================
12|train MultiStepActorCritic..time_epoch ······················| 24.333
12|train MultiStepActorCritic..time_rollout ······················| 0.754
12|train MultiStepActorCritic..learning_rate ······················| 0.000
12|train MultiStepActorCritic..policy_loss 0 | -0.016
12|train MultiStepActorCritic..policy_grad_norm 0 | 0.015
12|train MultiStepActorCritic..policy_entropy 0 | 0.686
12|train MultiStepActorCritic..critic_value 0 | -56.659
12|train MultiStepActorCritic..critic_value_loss 0 | 33.026
12|train MultiStepActorCritic..critic_grad_norm 0 | 0.500
12|train MultiStepActorCritic..time_update ······················| 1.205
12|train DiscreteActionEvents action substep_0/order | [len:8000, μ:0.5]
12|train DiscreteActionEvents action substep_0/piece_idx | [len:8000, μ:169.2]
12|train DiscreteActionEvents action substep_0/rotation | [len:8000, μ:1.0]
12|train BaseEnvEvents reward median_step_count | 200.000
12|train BaseEnvEvents reward mean_step_count | 200.000
12|train BaseEnvEvents reward total_step_count | 96000.000
12|train BaseEnvEvents reward total_episode_count | 480.000
12|train BaseEnvEvents reward episode_count | 40.000
12|train BaseEnvEvents reward std | 34.248
12|train BaseEnvEvents reward mean | -186.450
12|train BaseEnvEvents reward min | -259.000
12|train BaseEnvEvents reward max | -130.000
To get a nicer view on these numbers we can also take a look at the stats with Tensorboard.
tensorboard --logdir outputs
You can view it with your browser at http://localhost:6006/.

For now we can only inspect standard metrics such as reward statistics or mean_step_counts per episode. Unfortunately, this is not too informative with respect to the cutting problem we are currently addressing. In the next part we will show how to make logging much more informative by introducing events and KPIs.
Adding Events and KPIs¶
The complete code for this part of the tutorial can be found here
# file structure
- cutting_2d
- main.py # modified
- env
- core_env.py # modified
- inventory.py # modified
- maze_state.py
- maze_action.py
- renderer.py
- maze_env.py
- events.py # new
- kpi_calculator.py # new
- space_interfaces
- dict_action_conversion.py
- dict_observation_conversion.py
- conf ...
Page Overview
Events¶
In the previous section we have trained the initial version of our cutting environment and already learned how we can watch the training process with commandline and Tensorboard logging. However, watching only standard metrics such as reward or episode step count is not always too informative with respect to the agents behaviour and the problem at hand.
For example we might be interested in how often an agent selects an invalid cutting piece or specifies and invalid cutting setting. To tackle this issue and to enable better inspection and logging tools we introduce an event system that will be also reused in the reward customization section of this tutorial.
In particular, we introduce two event types related to the cutting process as well as inventory management. For each event we can define which statistics are computed at which stage of the aggregation process (event, step, epoch) via event decorators:
@define_step_stats(len)
: Events \(e_i\) are collected as a list of events \(\{e_i\}\). Thelen
function counts how often such an event occurred in the current environment step \(Stats_{Step}=|\{e_i\}|\).@define_episode_stats(sum)
: Defines how the \(S\) step statistics should be aggregated to episode statistics by simply summing them up: \(Stats_{Episode}=\sum^S Stats_{Step}\)@define_epoch_stats(np.mean, output_name="mean_episode_total")
: A training epoch consists of N episodes. This decorator defines that epoch statistics should be the average of the contained episodes: \(Stats_{Epoch}=(\sum^N Stats_{Episode})/N\)
Below we will see that theses statistics will now be considered by the logging system as InventoryEvents and CuttingEvents. For more details on event decorators and the underlying working principles we refer to the dedicated section on event and KPI logging.
from abc import ABC
import numpy as np
from maze.core.log_stats.event_decorators import define_step_stats, define_episode_stats, define_epoch_stats
class CuttingEvents(ABC):
"""Events related to the cutting process."""
@define_epoch_stats(np.mean, output_name="mean_episode_total")
@define_episode_stats(sum)
@define_step_stats(len)
def invalid_piece_selected(self):
"""An invalid piece is selected for cutting."""
@define_epoch_stats(np.mean, output_name="mean_episode_total")
@define_episode_stats(sum)
@define_step_stats(len)
def valid_cut(self, current_demand: (int, int), piece_to_cut: (int, int), raw_piece_size: (int, int),
cutting_area: float):
"""A valid cut was performed."""
@define_epoch_stats(np.mean, output_name="mean_episode_total")
@define_episode_stats(sum)
@define_step_stats(len)
def invalid_cut(self, current_demand: (int, int), piece_to_cut: (int, int), raw_piece_size: (int, int)):
"""Invalid cutting parameters have been specified."""
class InventoryEvents(ABC):
"""Events related to inventory management."""
@define_epoch_stats(np.mean, output_name="mean_episode_total")
@define_episode_stats(sum)
@define_step_stats(len)
def piece_discarded(self, piece: (int, int)):
"""The inventory is full and a piece has been discarded."""
@define_epoch_stats(np.mean, input_name="step_mean", output_name="step_mean")
@define_epoch_stats(max, input_name="step_max", output_name="step_max")
@define_episode_stats(np.mean, output_name="step_mean")
@define_episode_stats(max, output_name="step_max")
@define_step_stats(None)
def pieces_in_inventory(self, value: int):
"""Reports the count of pieces currently in the inventory."""
@define_epoch_stats(np.mean, output_name="mean_episode_total")
@define_episode_stats(sum)
@define_step_stats(len)
def piece_replenished(self):
"""A new raw cutting piece has been replenished."""
KPI Calculator¶
The goal of the cutting 2d environment is to learn a cutting policy that requires as little as possible raw inventory pieces for fulfilling upcoming customer demand. This metric is exactly what we define as the KPI to watch and optimize, e.g. the raw_piece_usage_per_step.
As you will see below the logging system considers such KPIs and prints statistics of these along with the remaining BaseEnvEvents.
from typing import Dict
from maze.core.env.maze_state import MazeStateType
from maze.core.log_events.kpi_calculator import KpiCalculator
from maze.core.log_events.episode_event_log import EpisodeEventLog
from .events import InventoryEvents
class Cutting2dKpiCalculator(KpiCalculator):
"""KPIs for 2D cutting environment.
The following KPIs are available: Raw pieces used per step
"""
def calculate_kpis(self, episode_event_log: EpisodeEventLog, last_maze_state: MazeStateType) -> Dict[str, float]:
"""Calculates the KPIs at the end of episode."""
# get overall step count of episode
step_count = len(episode_event_log.step_event_logs)
# count raw inventory piece replenishment events
raw_piece_usage = 0
for _ in episode_event_log.query_events(InventoryEvents.piece_replenished):
raw_piece_usage += 1
# compute step normalized raw piece usage
return {"raw_piece_usage_per_step": raw_piece_usage / step_count}
Updating CoreEnv and Inventory¶
There are also a few changes we have to make in the CoreEnvironment:
initialize the Publisher-Subscriber and the KPI Calculator
creating the event topics for cutting and inventory events when setting up the environment
instead of writing relevant events into the info dictionary in the step function we can now trigger the respective events.
...
from maze.core.events.pubsub import Pubsub
from .events import CuttingEvents, InventoryEvents
from .kpi_calculator import Cutting2dKpiCalculator
class Cutting2DCoreEnvironment(CoreEnv):
def __init__(self, max_pieces_in_inventory: int, raw_piece_size: (int, int), static_demand: (int, int)):
super().__init__()
...
# init pubsub for event to reward routing
self.pubsub = Pubsub(self.context.event_service)
# KPIs calculation
self.kpi_calculator = Cutting2dKpiCalculator()
def _setup_env(self):
"""Setup environment."""
inventory_events = self.pubsub.create_event_topic(InventoryEvents)
self.inventory = Inventory(self.max_pieces_in_inventory, self.raw_piece_size, inventory_events)
self.inventory.replenish_piece()
self.cutting_events = self.pubsub.create_event_topic(CuttingEvents)
def step(self, maze_action: Cutting2DMazeAction) -> Tuple[Cutting2DMazeState, np.array, bool, Dict[Any, Any]]:
"""Summary of the step (simplified, not necessarily respecting the actual order in the code):
1. Check if the selected piece to cut is valid (i.e. in inventory, large enough etc.)
2. Attempt the cutting
3. Replenish a fresh piece if needed and return an appropriate reward
:param maze_action: Cutting MazeAction to take.
:return: maze_state, reward, done, info
"""
info, reward = {}, 0
replenishment_needed = False
# check if valid piece id was selected
if maze_action.piece_id >= self.inventory.size():
self.cutting_events.invalid_piece_selected()
# perform cutting
else:
piece_to_cut = self.inventory.pieces[maze_action.piece_id]
# attempt the cut
if self.inventory.cut(maze_action, self.current_demand):
self.cutting_events.valid_cut(current_demand=self.current_demand, piece_to_cut=piece_to_cut,
raw_piece_size=self.raw_piece_size)
replenishment_needed = piece_to_cut == self.raw_piece_size
else:
# assign a negative reward for invalid cutting attempts
self.cutting_events.invalid_cut(current_demand=self.current_demand, piece_to_cut=piece_to_cut,
raw_piece_size=self.raw_piece_size)
reward = -2
# check if replenishment is required
if replenishment_needed:
self.inventory.replenish_piece()
# assign negative reward if a piece has to be replenished
reward = -1
# step execution finished, write step statistics
self.inventory.log_step_statistics()
# compile env state
maze_state = self.get_maze_state()
return maze_state, reward, False, info
def get_kpi_calculator(self) -> Cutting2dKpiCalculator:
"""KPIs are supported."""
return self.kpi_calculator
For the inventory we proceed analogously and also trigger the respective events.
...
from .events import InventoryEvents
class Inventory:
"""Holds the inventory of 2D pieces and performs cutting.
:param max_pieces_in_inventory: Size of the inventory. If full, the oldest pieces get discarded.
:param raw_piece_size: Size of a fresh raw (= full-size) piece.
:param inventory_events: Inventory event dispatch proxy.
"""
def __init__(self, max_pieces_in_inventory: int, raw_piece_size: (int, int),
inventory_events: InventoryEvents):
...
self.inventory_events = inventory_events
def store_piece(self, piece: (int, int)) -> None:
"""Store the given piece.
:param piece: Piece to store.
"""
# If we would run out of storage space, discard the oldest piece first
if self.is_full():
self.pieces.pop(0)
self.inventory_events.piece_discarded(piece=piece)
self.pieces.append(piece)
def replenish_piece(self) -> None:
"""Add a fresh raw piece to inventory."""
self.store_piece(self.raw_piece_size)
self.inventory_events.piece_replenished()
def log_step_statistics(self):
"""Log inventory statistics once per step"""
self.inventory_events.pieces_in_inventory(self.size())
Test Script¶
The following snippet will instantiate the environment and run it for 15 steps.
To get access to event and KPI logging we need to wrap the environment with the
LogStatsWrapper
.
To simplify the statistics logging setup we rely on the
SimpleStatsLoggingSetup
helper class.
""" Test script CoreEnv """
from maze.utils.log_stats_utils import SimpleStatsLoggingSetup
from maze.core.wrappers.log_stats_wrapper import LogStatsWrapper
from tutorial_maze_env.part04_events.env.maze_env import maze_env_factory
def main():
# init maze environment including observation and action interfaces
env = maze_env_factory(max_pieces_in_inventory=200,
raw_piece_size=[100, 100],
static_demand=(30, 15))
# wrap environment with logging wrapper
env = LogStatsWrapper(env, logging_prefix="main")
# register a console writer and connect the writer to the statistics logging system
with SimpleStatsLoggingSetup(env):
# reset environment
obs = env.reset()
# run interaction loop
for i in range(15):
# sample random action
action = env.action_space.sample()
# take actual environment step
obs, reward, done, info = env.step(action)
if __name__ == "__main__":
""" main """
main()
When running the script you will get an output as shown below. Note that statistics of both, events and KPIs, are printed along with default reward or action statistics.
step|path | value
=====|==========================================================================|====================
1|main DiscreteActionEvents action substep_0/order | [len:15, μ:0.5]
1|main DiscreteActionEvents action substep_0/piece_idx | [len:15, μ:82.3]
1|main DiscreteActionEvents action substep_0/rotation | [len:15, μ:0.7]
1|main BaseEnvEvents reward median_step_count | 15.000
1|main BaseEnvEvents reward mean_step_count | 15.000
1|main BaseEnvEvents reward total_step_count | 15.000
1|main BaseEnvEvents reward total_episode_count | 1.000
1|main BaseEnvEvents reward episode_count | 1.000
1|main BaseEnvEvents reward std | 0.000
1|main BaseEnvEvents reward mean | -29.000
1|main BaseEnvEvents reward min | -29.000
1|main BaseEnvEvents reward max | -29.000
1|main InventoryEvents piece_replenished mean_episode_total | 3.000
1|main InventoryEvents pieces_in_inventory step_max | 200.000
1|main InventoryEvents pieces_in_inventory step_mean | 200.000
1|main CuttingEvents invalid_cut mean_episode_total | 14.000
1|main InventoryEvents piece_discarded mean_episode_total | 2.000
1|main CuttingEvents valid_cut mean_episode_total | 1.000
1|main BaseEnvEvents kpi max/raw_piece_usage_..| 0.000
1|main BaseEnvEvents kpi min/raw_piece_usage_..| 0.000
1|main BaseEnvEvents kpi std/raw_piece_usage_..| 0.000
1|main BaseEnvEvents kpi mean/raw_piece_usage..| 0.000
Training with Events and KPIs¶
The complete code for this part of the tutorial can be found here
# file structure
- cutting_2d
- main.py
- env ...
- space_interfaces ...
- conf
- env
- tutorial_cutting_2d_events.yaml # new
- model
- tutorial_cutting_2d_events.yaml # new
- wrappers
- tutorial_cutting_2d_events.yaml # new
Page Overview
Hydra Configuration¶
The entire structure of this example is identical to the one on training the MazeEnv. Everything regarding the event systems was already changed in the section on adding events and KPIs and the trainers will consider these changes implicitly.
Training an Agent¶
To retrain the agent on the environment extended with event and KPI logging, run
maze-run -cn conf_train env=tutorial_cutting_2d_events wrappers=tutorial_cutting_2d_events \
model=tutorial_cutting_2d_events algorithm=ppo
rc = RunContext(
env="tutorial_cutting_2d_events",
wrappers="tutorial_cutting_2d_events",
model="tutorial_cutting_2d_events", algorithm="ppo"
)
rc.train()
Running the trainer should print an extended command line output similar to the one shown below. In addition to base events we now also get a statistics log of CuttingEvents, InventoryEvents and KPIs.
step|path | value
=====|============================================================================|====================
6|train MultiStepActorCritic..time_epoch ······················| 24.548
6|train MultiStepActorCritic..time_rollout ······················| 0.762
6|train MultiStepActorCritic..learning_rate ······················| 0.000
6|train MultiStepActorCritic..policy_loss 0 | -0.020
6|train MultiStepActorCritic..policy_grad_norm 0 | 0.013
6|train MultiStepActorCritic..policy_entropy 0 | 0.760
6|train MultiStepActorCritic..critic_value 0 | -49.238
6|train MultiStepActorCritic..critic_value_loss 0 | 50.175
6|train MultiStepActorCritic..critic_grad_norm 0 | 0.500
6|train MultiStepActorCritic..time_update ······················| 1.210
6|train DiscreteActionEvents action substep_0/order | [len:8000, μ:0.0]
6|train DiscreteActionEvents action substep_0/piece_idx | [len:8000, μ:174.2]
6|train DiscreteActionEvents action substep_0/rotation | [len:8000, μ:1.0]
6|train BaseEnvEvents reward median_step_count | 200.000
6|train BaseEnvEvents reward mean_step_count | 200.000
6|train BaseEnvEvents reward total_step_count | 48000.000
6|train BaseEnvEvents reward total_episode_count | 240.000
6|train BaseEnvEvents reward episode_count | 40.000
6|train BaseEnvEvents reward std | 38.427
6|train BaseEnvEvents reward mean | -182.175
6|train BaseEnvEvents reward min | -323.000
6|train BaseEnvEvents reward max | -119.000
6|train InventoryEvents piece_replenished mean_episode_total | 15.325
6|train InventoryEvents piece_discarded mean_episode_total | 67.400
6|train InventoryEvents pieces_in_inventory step_max | 200.000
6|train InventoryEvents pieces_in_inventory step_mean | 200.000
6|train CuttingEvents valid_cut mean_episode_total | 116.075
6|train CuttingEvents invalid_cut mean_episode_total | 83.925
6|train BaseEnvEvents kpi max/raw_piece_usage_..| 0.135
6|train BaseEnvEvents kpi min/raw_piece_usage_..| 0.020
6|train BaseEnvEvents kpi std/raw_piece_usage_..| 0.028
6|train BaseEnvEvents kpi mean/raw_piece_usage..| 0.077
Of course these changes are also reflected in the Tensorboard log which you can again view with your browser at http://localhost:6006/.
tensorboard --logdir outputs
As you can see we now have the two additional sections train_CuttingEvents and train_InventoryEvents available.

A closer look at these events reveals that the agent actually starts to learn something meaning full as the number of invalid cuts decreases which of course implies that the number of valid cuts increases and we are able to full fill the current customer demand.

Adding Reward Customization¶
The complete code for this part of the tutorial can be found here
# file structure
- cutting_2d
- main.py # modified
- env
- core_env.py # modified
- inventory.py
- maze_state.py
- maze_action.py
- renderer.py
- maze_env.py # modified
- events.py
- kpi_calculator.py
- space_interfaces
- dict_action_conversion.py
- dict_observation_conversion.py
- reward
- default_reward.py # new
Page Overview
Reward¶
In this part of the tutorial we introduce how to reuse the event system for
reward shaping and customization via the
RewardAggregatorInterface
.
In Maze, reward aggregators usually calculate reward from the current environment state, events that happened during the last step, or a combination thereof. Calculating reward from state is generally simpler, but not a good fit for this environment – here, the reward is more concerned with what happened (was an invalid cut attempted? A new raw piece replenished?) than with the current state (i.e., the inventory state after the step). Hence, the reward calculation here is based on events (which is in general more flexible than using the environment state only).
The DefaultRewardAggregator
does the following:
Requests the required event interfaces via
get_interfaces
(here CuttingEvents and InventoryEvents).Collects rewards and penalties according to relevant events.
Aggregates the individual event rewards and penalties to a single scalar reward signal.
Note that this reward aggregator can have any form as long as it provides a scalar reward function that can be used for training. This gives a lot of flexibility in shaping rewards without the need to change the actual implementation of the environment (more on this topic).
from typing import List, Optional
from maze.core.annotations import override
from maze.core.env.maze_state import MazeStateType
from maze.core.env.reward import RewardAggregatorInterface
from ..env.events import CuttingEvents, InventoryEvents
class DefaultRewardAggregator(RewardAggregatorInterface):
"""Default reward scheme for the 2D cutting env.
:param invalid_action_penalty: Negative reward assigned for an invalid cutting specification.
:param raw_piece_usage_penalty: Negative reward assigned for starting a new raw inventory piece.
"""
def __init__(self, invalid_action_penalty: float, raw_piece_usage_penalty: float):
super().__init__()
self.invalid_action_penalty = invalid_action_penalty
self.raw_piece_usage_penalty = raw_piece_usage_penalty
@override(RewardAggregatorInterface)
def get_interfaces(self):
"""Specification of the event interfaces this subscriber wants to receive events from.
Every subscriber must implement this configuration method.
:return: A list of interface classes"""
return [CuttingEvents, InventoryEvents]
@override(RewardAggregatorInterface)
def summarize_reward(self, maze_state: Optional[MazeStateType] = None) -> float:
"""Assign rewards and penalties according to respective events.
:param maze_state: Not used by this reward aggregator.
:return: List of individual event rewards.
"""
rewards: List[float] = []
# penalty for starting a new raw inventory piece
for _ in self.query_events(InventoryEvents.piece_replenished):
rewards.append(self.raw_piece_usage_penalty)
# penalty for selecting an invalid piece for cutting
for _ in self.query_events(CuttingEvents.invalid_piece_selected):
rewards.append(self.invalid_action_penalty)
# penalty for specifying invalid cutting parameters
for _ in self.query_events(CuttingEvents.invalid_cut):
rewards.append(self.invalid_action_penalty)
return sum(rewards)
Updating the Core- and MazeEnv¶
We also have to make a few modifications in the CoreEnv
:
Initialize the reward aggregator in the constructor.
Instead of accumulating reward in the if-else branches of the
step
function we summarize it only once at the end.
...
class Cutting2DCoreEnvironment(CoreEnv):
"""Environment for cutting 2D pieces based on the customer demand. Works as follows:
...
:param reward_aggregator: Either an instantiated aggregator or a configuration dictionary.
"""
def __init__(self, max_pieces_in_inventory: int, raw_piece_size: (int, int), static_demand: (int, int),
reward_aggregator: RewardAggregatorInterface):
super().__init__()
...
# init reward and register it with pubsub
self.reward_aggregator = reward_aggregator
self.pubsub.register_subscriber(self.reward_aggregator)
def step(self, maze_action: Cutting2DMazeAction) -> Tuple[Cutting2DMazeState, np.array, bool, Dict[Any, Any]]:
"""Summary of the step (simplified, not necessarily respecting the actual order in the code):
1. Check if the selected piece to cut is valid (i.e. in inventory, large enough etc.)
2. Attempt the cutting
3. Replenish a fresh piece if needed and return an appropriate reward
:param maze_action: Cutting maze_action to take.
:return: state, reward, done, info
"""
info = {}
replenishment_needed = False
# check if valid piece id was selected
if maze_action.piece_id >= self.inventory.size():
self.cutting_events.invalid_piece_selected()
# perform cutting
else:
piece_to_cut = self.inventory.pieces[maze_action.piece_id]
# attempt the cut
if self.inventory.cut(maze_action, self.current_demand):
self.cutting_events.valid_cut(current_demand=self.current_demand, piece_to_cut=piece_to_cut,
raw_piece_size=self.raw_piece_size)
replenishment_needed = piece_to_cut == self.raw_piece_size
else:
# assign a negative reward for invalid cutting attempts
self.cutting_events.invalid_cut(current_demand=self.current_demand, piece_to_cut=piece_to_cut,
raw_piece_size=self.raw_piece_size)
# check if replenishment is required
if replenishment_needed:
self.inventory.replenish_piece()
# assign negative reward if a piece has to be replenished
# step execution finished, write step statistics
self.inventory.log_step_statistics()
# compile env state
maze_state = self.get_maze_state()
# aggregate reward from events
reward = self.reward_aggregator.summarize_reward(maze_state)
return maze_state, reward, False, info
Finally, we update the maze_env_factory
function for instantiating the trainable MazeEnv
and we are all set up for training with event based, customized rewards.
...
def maze_env_factory(max_pieces_in_inventory: int, raw_piece_size: (int, int),
static_demand: (int, int)) -> Cutting2DEnvironment:
"""Convenience factory function that compiles a trainable maze environment.
(for argument details see: Cutting2DCoreEnvironment)
"""
# init reward aggregator
reward_aggregator = DefaultRewardAggregator(invalid_action_penalty=-2, raw_piece_usage_penalty=-1)
# init core environment
core_env = Cutting2DCoreEnvironment(max_pieces_in_inventory=max_pieces_in_inventory,
raw_piece_size=raw_piece_size,
static_demand=static_demand,
reward_aggregator=reward_aggregator)
# init maze environment including observation and action interfaces
action_conversion = ActionConversion(max_pieces_in_inventory=max_pieces_in_inventory)
observation_conversion = ObservationConversion(raw_piece_size=raw_piece_size,
max_pieces_in_inventory=max_pieces_in_inventory)
return Cutting2DEnvironment(core_env, action_conversion, observation_conversion)
Where to Go Next¶
As the reward is implemented via a reward aggregator that is methodologically identical to the initial version there is no need to retrain the model for now. However, we highly recommend to proceed with the more advanced tutorial on Structured Environments and Action Masking.
API Documentation¶
This page provides an overview of the Maze API documentation
Environment Interfaces¶
This page contains the reference documentation for environment interfaces.
maze.core.env¶
Environment interfaces:
Interface definition for reinforcement learning environments defining the minimum required functionality for being considered an environment. |
|
Identifies an actor in the environment. |
|
Interface for environments with sub-step structure, which is generally enough to cover multi-step, hierarchical and multi-agent environments. |
|
Interface definition for core environments forming the basis for actual RL trainable environments. |
|
This interface complements the StructuredEnv by action and observation spaces. |
|
Base class for (gym style) environments wrapping a core environment and defining state and execution interfaces. |
|
Interface for rendering functionality in environments (compatible with gym env). |
|
This interface provides a standard way of exposing internal MazeState and MazeAction objects for trajectory data recording. |
|
This interface provides a standard way of exposing environment components whose state should be serialized together with the environment state object when for example recording trajectory data. |
|
This interface provides a standard way of exposing environment time to external components and wrappers. |
|
This interface provides a standard way of attaching environment events to the log statistics system. |
|
Environment interface for simulated environments. |
Interfaces for additional components:
Interface specifying the conversion of abstract environment state to the gym-compatible observation. |
|
Interface specifying the conversion of agent actions to actual environment MazeActions. |
|
Internal indicator of special typing constructs. |
|
Internal indicator of special typing constructs. |
|
Event aggregation object for reward customization and shaping. |
|
This class keeps track of services that can be employed by all objects of the agent-environment loop. |
Environment Wrappers¶
This page contains the reference documentation for environment wrappers. Here you can find a more extensive write up on how to work with these.
Overview
Interfaces and Utilities¶
These are the wrapper interfaces, base classes and interfaces:
A transparent environment Wrapper that works with any manifestation of |
Types of Wrappers:
A Wrapper with typing support modifying the environments observation. |
|
A Wrapper with typing support modifying the agents action. |
|
A Wrapper with typing support modifying the reward before passed to the agent. |
|
Handles dynamic registration of Wrapper sub-classes. |
Built-in Wrappers¶
Below you find the reference documentation for environment wrappers.
General Wrappers:
A statistics logging wrapper for |
|
A MazeEnv monitoring wrapper logging events for observations, actions and rewards. |
|
An observation visualization wrapper allows to apply custom observation visualization functions which are then shown in Tensorboard. |
|
Wrapper to limit the environment step count, equivalent to gym.wrappers.time_limit. |
|
A wrapper skipping the first few steps by taking random actions. |
|
This class wraps a given StructuredEnvSpacesMixin env to ensure that all observation- and action-spaces are sorted alphabetically. |
|
Wraps observations and actions by replacing dictionary spaces with the sole contained sub-space. |
ObservationWrappers:
Wraps a single observation into a dictionary space. |
|
An wrapper stacking the observations of multiple subsequent time steps. |
|
Wraps observations by replacing the dictionary observation space with the sole contained sub-space. |
ActionWrappers:
Wraps either a single action space or a tuple action space into dictionary space. |
|
Wraps actions by replacing the dictionary action space with the sole contained sub-space. |
|
Splits an actions into separate ones. |
|
The DiscretizeActionsWrapper provides functionality for discretizing individual continuous actions into discrete |
RewardWrappers:
Scales original step reward by a multiplicative scaling factor. |
|
Clips original step reward to range [min, max]. |
|
Normalizes step reward by dividing through the standard deviation of the discounted return. |
Observation Pre-Processing Wrapper¶
Below you find the reference documentation for observation pre-processing. Here you can find a more extensive write up on how to work with the observation pre-processing package.
These are interfaces and components required for observation pre-processing:
An observation pre-processing wrapper. |
|
Interface for observation pre-processors. |
These are the available built-in maze.pre_processors compatible with the PreProcessingWrapper:
An array flattening pre-processor. |
|
An one-hot encoding pre-processor for categorical features. |
|
An image resizing pre-processor. |
|
An array transposition pre-processor. |
|
An un-squeeze pre-processor. |
|
An rgb-to-gray-scale conversion pre-processor. |
Observation Normalization Wrapper¶
Below you find the reference documentation for observation normalization. Here you can find a more extensive write up on how to work with the observation normalization package.
These are interfaces and utility functions required for observation normalization:
An observation normalization wrapper. |
|
Abstract base class for normalization strategies. |
|
Obtain the normalization statistics of a given environment. |
|
Helper function estimating normalization statistics. |
|
Wrap an existing env factory to assign the passed normalization statistics. |
These are the available built-in maze.normalization_strategies compatible with the ObservationNormalizationWrapper:
Normalizes observations to have zero mean and standard deviation one. |
|
Normalizes observations to value range [0, 1]. |
Gym Environment Wrapper¶
Below you find the reference documentation for wrapping gym environments. Here you can find a more extensive write up on how to integrate Gym environments within Maze.
These are the contained components:
Wraps a Gym env into a Maze environment. |
|
Initializes a |
|
Wraps a Gym environment into a maze core environment. |
|
A Maze-style Gym renderer. |
|
A dummy conversion interface asserting that the observation is packed into a dictionary space. |
|
A dummy conversion interface asserting that the action is packed into a dictionary space. |
Event System, Logging & Statistics¶
This page contains the reference documentation for the event and logging system.
Overview
Event System¶
These are interfaces, classes and utility functions of the event system:
Event aggregation object. |
|
Implementation of a message broker (Pubsub stands for publish and subscribe). |
|
Constructs a proxy instance of the event interface, as required by EventService and LogStatsAggregator. |
|
Base class for all services that integrate with the event system and therefore use EventService as their backend. |
|
Manages the recording of event invocations and provides simple event routing functionality. |
|
A collection of EventRecord instances that can be queried by event specification. |
|
This auxiliary class is used to record calls to the event interface |
Event Logging¶
These are the components of the event system:
Logs all events dispatched by the environment during one step. |
|
Keeps logs of all events dispatched by an environment during one episode. |
|
Interface for calculating KPI metrics. |
|
Handles registration of event log writers. |
|
Interface for modules writing out the event log data. |
|
Writes event logs into TSV files. |
|
Represents one row into the output file for the |
|
Simple setup for logging of environment events with all their attributes. |
|
Event topic class with logging statistics based only on observations, therefore applicable to any valid reinforcement learning environment. |
|
Event topic class with logging statistics based only on Gym space actions, therefore applicable to any valid reinforcement learning environment. |
|
Event topic class with logging statistics based only on rewards, therefore applicable to any valid reinforcement learning environment. |
|
Checks the type of value and calls the correct plotting function accordingly. |
|
Creates simple matplotlib histogram of value. |
|
Counts the categories in value and prepares a relative bar plot of these. |
|
Creates simple matplotlib violin plot of value. |
Statistics Logging¶
These are the components of the statistics logging system:
Interface to access logging statistics generated by the environment. |
|
Log statistics writer implementation for the console, mainly for debugging purposes. |
|
Log statistics writer implementation for Tensorboard. |
|
Log statistics aggregation levels. |
|
An interface to receive log statistics. |
|
Complements the event system by providing aggregation functionality. |
|
A minimal interface concrete log statistics writers must implement. |
|
Internal class that encapsulates the global state of the logging system. |
|
Auxiliary class returned by get_stats_logger. |
|
Set the concrete writer implementation that will receive all successive statistics logging. |
|
Helper function. |
|
Notifies the logging system that the current step is finished. |
|
Creates an object that can be used to pipe LogStatAggregator instances with the logging writers. |
|
Event method decorator, defines a new step statistics calculation for this event. |
|
Event method decorator, defines a new episode statistics calculation for this event. |
|
Event method decorator, defines a new epoch statistics calculation for this event. |
|
Event method decorator, defines a grouping of all calculated statistics by an attribute. |
|
Event method decorator, defines a plot. |
|
the histogram reducer function |
|
Basic data structure for log statistics |
|
Basic data structure for log statistics |
|
Basic data structure for log statistics |
|
Basic data structure for log statistics |
Rendering¶
These are interfaces, classes and utility functions for the rendering system:
Interface for renderers of individual environments. |
|
Simple statistics rendering based on episode event logs. |
|
Renders customizable statistics on top of event logs. |
|
Event logs viewer for Jupyter Notebooks, built using ipython widgets. |
|
Trajectory viewer for Jupyter Notebooks, built using ipython widgets. |
|
Render trajectory data with the possibility to browse back and forward through the episode steps using keyboard. |
|
Interface for classes exposing arguments available at renderers. |
|
Represents an argument which can take on a value of integer in a particular range. |
|
Represents an argument where a single value can be chosen from an array of allowed options. |
Trajectory Recorder¶
These are interfaces, classes and utility functions for recording trajectory data:
Base class of trajectory data set for imitation learning that keeps all loaded data in memory. |
|
Data loading worker used to map states to actual observations. |
|
Base class for processing individual trajectories. |
|
Identity processing method |
|
Implementation of the dead-end-clipping preprocessor. |
|
Record of spaces (i.e., raw action, observation, and associated data) from a single sub-step. |
|
The central part of internal API. |
|
Records spaces (i.e., raw actions and observations) from a single environment step. |
|
Keeps trajectory data for one step. |
|
Common functionality of trajectory records. |
|
Holds state record data (i.e., Maze states and actions, independent of the current |
|
Holds structured spaces records (i.e., raw actions and observations recorded during a rollout). |
|
Simple setup for environment monitoring. |
|
Simple setup for trajectory data recording. |
|
Handles registration of trajectory data writers. |
|
Interface for modules serializing the trajectory data. |
|
Simple trajectory data writer. |
General and Rollout Runners¶
This page contains the reference documentation for all kinds of runners.
Overview
General Runners¶
These are the basic interfaces, classes and utility functions of runners:
Runner interface for running Maze from CLI. |
|
Run a CLI task based on the provided configuration. |
Rollout Runners¶
These are interfaces, classes and utility functions for rollout runners:
Here can find the documentation for training runners.
General abstract class for rollout runners. |
|
Rollouts a given policy in a given environment, recording the trajectory (in the form of raw actions and observations). |
|
Runs rollout in the local process. |
|
Runs rollout in multiple processes in parallel. |
|
Class encapsulating functionality performed in worker processes. |
|
Keeps the statistics and event logs from the last episode so that it can then be shipped to the main process. |
|
Tuple for passing episode stats from workers to the main process. |
|
Tuple for passing error reports from the workers to the main process. |
Policies, Critics and Agents¶
This page contains the reference documentation for policies, critics and agents.
maze.core.agent¶
Policies:
Generic flat policy interface. |
|
Structured policy class designed to work with structured environments. |
|
Encapsulates multiple torch policies along with a distribution mapper for training and rollouts in structured environments. |
|
Dataclass for holding the output of the policy’s compute full output method |
|
A structured representation of a policy output over a full (flat) environment step. |
|
Encapsulates one or more policies identified by policy IDs. |
|
Implements a random structured policy. |
|
Dummy structured policy for the CartPole env. |
|
Structured policy used for rollouts of trained models. |
Critics:
Structured state critic class designed to work with structured environments. |
|
State Critic Step output holds the output of an a critic for an individual env step. |
|
Critic output holds the output of a critic for one full flat env step. |
|
State Critic input for a single substep of the env, holding the tensor_dict and the actor_ids corresponding to where the embedding logits where retrieved if applicable, otherwise just the corresponding actor. |
|
State Critic output defined as it’s own type, since it has to be explicitly build to be compatible with shared embedding networks. |
|
Encapsulates multiple torch state critics for training in structured environments. |
|
One critic is shared across all sub-steps or actors (default to use for standard gym-style environments). |
|
Each sub-step or actor gets its individual critic. |
|
First sub step gets a regular critic, subsequent sub-steps predict a delta w.r.t. |
|
Structured state action critic class designed to work with structured environments. |
|
Encapsulates multiple torch state action critics for training in structured environments. |
|
One critic is shared across all sub-steps or actors (default to use for standard gym-style environments). |
|
Each sub-step or actor gets its individual critic. |
Models:
Base class for any torch model. |
|
Encapsulates a structured torch policy and critic for training actor-critic algorithms in structured environments. |
Agent Deployment¶
This page contains the reference documentation for the Maze agent deployment components.
Encapsulates an agent, space interfaces and a stack of wrappers, to make the agent’s MazeActions accessible to an external env. |
|
Executes the provided policies in an Agent Deployment setting. |
|
Action object for encapsulation of multiple action objects along with their respective probabilities. |
|
MazeAction object for encapsulation of multiple MazeAction objects along with their respective probabilities. |
|
Wrapper for action conversion interface when working with multiple candidate actions/MazeActions. |
|
Acts as a CoreEnv in the env stack in agent deployment scenario. |
Perception Module¶
This page contains the reference documentation of Maze Perception Module.
Overview
maze.perception.blocks¶
These are basic neural network building blocks and interfaces:
Interface for all perception blocks. |
|
Perception block normalizing the input and de-normalizing the output tensor dimensions. |
|
An inference block combining multiple perception blocks into one prediction module. |
|
Models a perception module inference graph. |
Feed Forward: these are built-in feed forward building blocks:
A block containing multiple subsequent dense layers. |
|
A block containing multiple subsequent vgg style convolutions. |
|
A block containing multiple subsequent strided convolution layers. |
|
A block containing multiple subsequent graph convolution stacks. |
|
A block containing multiple subsequent graph (multi-head) attention stacks. |
|
Implementation of a torch MultiHeadAttention block. |
|
PointNet block allowing to embed a variable sized set of point observations into a fixed size feature vector via the PointNet mechanics. |
Recurrent: these are built-in recurrent building blocks:
A block containing multiple subsequent LSTM layers followed by a final time-distributed dense layer with explicit non-linearity. |
General: these are build-in general purpose building blocks:
A flattening block. |
|
A feature correlation block. |
|
A feature concatenation block. |
|
A block applying a custom callable. |
|
A global average pooling block. |
|
A block applying global pooling with optional masking. |
|
A multi-index-slicing block. |
|
A repeat-to-match block. |
|
Implementation of a self-attention block as described by reference: https://arxiv.org/abs/1805.08318 |
|
Implementation of a self-attention block as described by reference: https://arxiv.org/abs/1706.03762 |
|
A slicing block. |
|
An action masking block. |
|
A block transforming a common nn.Module to a shape-normalized Maze perception block. |
Joint: these are build in joint building blocks combining multiple perception blocks:
A block containing a flattening stage followed by a dense layer block. |
|
A block containing multiple subsequent vgg style convolution stacks followed by flattening and a dense layer block. |
|
A block containing multiple subsequent vgg style convolution stacks followed by global average pooling. |
|
A block containing multiple subsequent strided convolutions followed by flattening and a dense layer block. |
|
A block containing a LSTM perception block followed by a Slicing Block keeping only the output of the final time step. |
maze.perception.builders¶
These are template model builders:
Base class for perception default model builders. |
|
A model builder that first processes individual observations, concatenates the resulting latent spaces and then processes this concatenated output to action and value outputs. |
maze.perception.models¶
These are model composers and components:
Abstract baseclass and interface definitions for model composers. |
|
Composes template models from configs. |
|
Composes models from explicit model definitions. |
|
Represents configuration of environment spaces (action & observation) used for model config. |
These are maze.perception.models.policies
Interface for policy (actor) network composers. |
|
Composes networks for probabilistic policies. |
There are maze.perception.models.critics
Interface for critic (value function) network composers. |
|
Interface for critic (value function) network composers. |
|
One critic is shared across all sub-steps or actors (default to use for standard gym-style environments). |
|
Each sub-step or actor gets its individual critic. |
|
First sub step gets a regular critic, subsequent sub-steps predict a delta w.r.t. |
|
alias of |
|
Interface for state action (Q) critic network composers. |
|
One critic is shared across all sub-steps or actors (default to use for standard gym-style environments). |
|
Each sub-step or actor gets its individual critic. |
|
alias of |
These are maze.perception.models.build_in models
Base flatten and concatenation model for policies and critics. |
|
Flatten and concatenation policy model. |
|
Flatten and concatenation state value model. |
maze.perception.perception_utils¶
These are some helper functions when working with the perception module:
Convert an observation space to the input shapes for the neural networks |
|
Merges an iterable of dictionary spaces (usually observations or actions from subsequent sub-steps) into a single dictionary containing all the items. |
|
Merges an iterable of dictionary spaces (usually observations or actions from subsequent sub-steps) into a single dictionary containing all the items. |
|
Converts any struct to torch.Tensors. |
|
Convert torch to np |
maze.perception.weight_init¶
These are some helper functions for initializing model weights:
Compiles normc weight initialization function initializing module weights with normc_initializer and biases with zeros. |
|
Compute the bias value for a sigmoid activation function such as in multi-binary action spaces (Bernoulli distributions). |
Action Spaces and Distributions Module¶
This page contains the reference documentation of Maze Action Spaces and Distributions Module.
These are interfaces, classes and utility functions:
Base class for all probability distributions. |
|
Base class for wrapping Torch probability distributions. |
|
Provides a mapping of spaces and action heads to the respective probability distributions to be used. |
|
Computes the arc-tangent hyperbolic. |
|
Clamping with tensor and broadcast support. |
These are built-in Torch probability distributions:
Categorical Torch probability distribution. |
|
Bernoulli Torch probability distribution for multi-binary action spaces. |
|
Diagonal Gaussian (Normal) Torch probability distribution. |
|
Tanh-squashed diagonal Gaussian (Normal) Torch probability distribution. |
|
Beta Torch probability distribution. |
These are combined probability distributions:
Multi-categorical probability distribution. |
|
Dictionary probability distribution. |
Core Utilities¶
These are general interfaces, classes and utility functions:
Annotation for documenting method overrides. |
|
Function to annotate unused variables. |
|
Set random seeds for numpy, torch and python random number generators. |
|
Manages the random seeding for maze. |
|
Compiles a flat gym.spaces.Dict space from a structured environment space. |
|
Flatten a dict of shape dicts to a single dict |
|
Read YAML file into a dict |
|
Convert lists to int-indexed dicts. |
|
Helper class to instantiate an environment from configuration with the help of the Registry. |
|
Create an environment instance from the hydra configuration, given the overrides. |
|
Supports the creation of instances from configuration, that can be plugged into the environments (like demand generators or reward schemes). |
|
Shorthand type for configuration corresponding to a single object. |
|
Shorthand type for a list or a dictionary of object parameters from the config files. |
|
Maintains cumulative moving mean and std of incoming numpy arrays along axis 0. |
Utilities¶
A collection of smaller auxiliary functions and classes:
maze.utils¶
Helper class to simplify the statistics logging setup. |
|
Resets the seed and global state to ensure that consecutive tests run under the same preconditions. |
|
Setup tensorboard logging, derive the logging directory from the script name. |
|
Timeout class, fires a TimeoutError after the given number of seconds elapsed. |
|
Convert the tensorboard log to a pandas DataFrame. |
|
A wrapper for multiprocessing.Process that supports exception handling and return objects. |
|
Colored command line output formatting |
maze.hydra_plugins¶
Custom Hydra launcher distributing the jobs in separate processes on the local machine. |
|
Hardcoded launcher configuration, linking the hydra/launcher=local override to the MazeLocalLauncher class |
Trainers and Training Runners¶
This page contains the reference documentation for trainers and training runners:
Overview
General¶
These are general interfaces, classes and utility functions for trainers and training runners:
Interface for trainers. |
|
Base class for training runner implementations. |
|
Top-level configuration structure. |
|
Model configuration structure. |
|
Base class for all specific algorithm configurations. |
|
Base class for model selection strategies. |
|
Best model selection strategy. |
|
Abstract interface for policy evaluation. |
|
Evaluates the given policy using multiple different evaluators (ran in sequence). |
|
Evaluates a given policy by rolling it out and collecting the mean reward. |
|
Value transformation (e.g. |
|
Scale reduction value transform according to Pohlen et al (2018). |
|
Convert support vector to scalar by probability weighted interpolation. |
|
Converts tensor of scalars into probability support vectors corresponding to the provided range. |
|
Abstract interface for all replay buffer implementations. |
|
Replay buffer for off policy learning. |
Trainers¶
These are interfaces, classes and utility functions for built-in trainers:
Actor-Critics (AC)¶
Abstract baseclass of AC runners. |
|
Runner for single-threaded training, based on SequentialVectorEnv. |
|
Runner for locally distributed training, based on SubprocVectorEnv. |
|
Base class for actor critic trainers. |
|
Event interface, defining statistics emitted by the A2CTrainer. |
|
Advantage Actor Critic. |
|
Algorithm parameters for multi-step A2C model. |
|
Proximal Policy Optimization trainer. |
|
Algorithm parameters for multi-step PPO model. |
|
Multi step advantage actor critic. |
|
Algorithm parameters for Impala. |
|
Events specific for the impala algorithm, in order to record and analyse it’s behaviour in more detail |
|
Common superclass for IMPALA runners, implementing the main training controls. |
|
Runner for single-threaded training, based on SequentialVectorEnv. |
|
Runner for locally distributed training, based on SubprocVectorEnv. |
|
Computes action log-probs from policy logits, actions and acton_spaces. |
|
V-trace for softmax policies. |
|
V-trace from log importance weights. |
|
With the selected log_probs for multi-discrete actions of behavior and target policies we compute the log_rhos for calculating the vtrace. |
|
Multi step soft actor critic. |
|
Algorithm parameters for SAC. |
|
Events specific for the SAC algorithm, in order to record and analyse it’s behaviour in more detail |
|
Common superclass for SAC runners, implementing the main training controls. |
|
Runner for single-threaded training, based on SequentialVectorEnv. |
Evolutionary Strategies (ES)¶
Trainer class for OpenAI Evolution Strategies. |
|
Algorithm parameters for evolution strategies model. |
|
Event interface, defining statistics emitted by the ESTrainer. |
|
Baseclass of ES training master runners (serves as basis for dev and other runners). |
|
Runner config for single-threaded training, based on ESDummyDistributedRollouts. |
|
A fixed length vector of deterministically generated pseudo-random floats. |
|
Abstract baseclass of an optimizer to be used with ES. |
|
Stochastic gradient descent with momentum |
|
Adam optimizer |
|
Result structure for distributed rollouts. |
|
Implementation of the ES distribution by running the rollouts synchronously in the same process. |
|
Abstract base class of ES rollout distribution. |
|
This exception is raised if the current rollout is intentionally aborted. |
|
The rollout generation is bound to a single worker environment by implementing it as a Wrapper class. |
|
Get the parameters of all sub-policies as a single flat vector. |
|
Overwrite the parameters of all sub-policies by a single flat vector. |
Imitation Learning (IL) and Learning from Demonstrations (LfD)¶
Event interface defining statistics emitted by the imitation learning trainers. |
|
Dev runner for imitation learning. |
|
Trainer for behavioral cloning learning. |
|
Algorithm parameters for behavioral cloning. |
|
Evaluates a given policy on validation data. |
|
Loss function for behavioral cloning. |
Utilities¶
Stack list of dictionaries holding numpy arrays as values. |
|
Inverse of |
|
Computes the cumulative gradient norm of all provided parameters. |
|
Stack list of dictionaries holding torch tensors as values. |
Parallelization¶
This page contains the reference documentation for the parallelization module.
Vectorized Environments¶
These are interfaces, classes and utility functions for vectorized environments:
Abstract base class for vectorised environments. |
|
Common superclass for the structured vectorised env implementations in Maze. |
|
Creates a simple wrapper for multiple environments, calling each environment in sequence on the current Python process. |
|
Creates a multiprocess wrapper for multiple environments, distributing each environment to its own process. |
|
Uses cloudpickle to serialize contents (otherwise multiprocessing tries to use pickle). |
|
Sink hole statistics consumer. |
|
Disable collection of statistics on epoch level to save memory. |
Distributed Actors¶
These are interfaces, classes and utility functions for distributed actors:
The base class for all distributed actors. |
|
Dummy implementation of distributed actors creates the actors as a list. Once the outputs are to |
|
Basic Distributed-Actors-Module using python multiprocessing.Process |
|
The base class for all distributed workers with buffer. |
|
Dummy implementation of distributed workers with buffer creates the workers as a list. |
Utilities¶
Reusable components used in multiple distribution scenarios:
Synchronizes policy updates and other information across workers on local machine. |
|
A wrapper around BaseManager, used for managing the broadcasting container in multiprocessing scenarios. |
Run Context¶
This page contains the reference documentation for Maze’ high-level Python API RunContext
.
This article documents in detail how to work with a RunContext
as well as its benefits and limitations
Utilities¶
Checks specified RunContext configuration for consistency and prepares it for the initialization procedure. |
|
A ConfigurationLoader loads and post-processes a particular configuration for RunContext. |
|
Available run modes for Python API, associated with the corresponding base config module names. |
|
Exception indicating Error in RunContext. |
|
Exception indicating Error due to inconsistent specification in RunContext. |
Run Context¶
RunContext offers convenient access to consistently configured training and rollout capabilities with minimal setup, yet is flexible enough to enable manipulation of every configurable aspect of Maze. |
For installing Maze just follow the installation instructions.
To see Maze in action check out a first example and our Getting Started Notebooks.
Try your own Gym env or visit our Maze step-by-step tutorial.
Clone this project template repo to start your own Maze project.
You can also find an extensive overview of Maze in the table of contents as well as the API documentation.
Spotlights¶
Below we list of some of Maze’s key features. The list is far from exhaustive but none the less a nice starting point to dive into the framework.
Get things rolling by training your environment and rolling out your policy in just a few lines of code with Maze’ high-level API.
Configure your applications and experiments with the Hydra config system
.
Design and visualize your policy and value networks with the Perception Module.
Pre-process and normalize your observations without writing boiler plate code.
Stick to your favourite tools and trainers by combining Maze with other RL frameworks.
Although Maze supports more complex environment structures you can of course still integrate existing Gym environments
.
Warning
This is a preliminary, non-stable release of Maze. It is not yet complete and not all of our interfaces have settled yet. Hence, there might be some breaking changes on our way towards the first stable release.
This project is powered by |
Any questions or feedback, just get in touch
Documentation Overview¶
Below you find an overview of the general Maze framework documentation, which is beyond the API documentation. The listed pages motivate and explain the underlying concepts but most importantly also provide code snippets and minimum working examples to quickly get you started.
Training¶
Here we show how to train a policy on a standard Gym or custom environment using algorithms and models from Maze. This guide focuses on the main mechanics of Maze training runs, plus also gives some pointers on how to customize the training with custom environments (using the tutorial Maze 2D-cutting environment as an example), models, etc.
The figure below shows a conceptual overview of the Maze training workflow.

On this page:
The first example demonstrates training with the default settings. The main purpose is to show how the Maze training pipeline works in general.
The second example explains how you can customize training on standard Gym and Maze environments (for which configuration files are already provided by Maze).
The third example explains how you can resume previous training runs.
The following section then explains what you need to customize training for your own project, including custom components and configuration files.
Finally, the last section shows how to launch training directly from Python (avoiding the CLI).
In order to fully understand the configuration mechanisms used here, you should familiarize yourself with how Maze makes use of the Hydra configuration framework.
Example 1: Your First Training Run¶
We can train a policy from scratch on the Cartpole environment with default settings using the command:
$ maze-run -cn conf_train env=gym_env env.name=CartPole-v0
rc = RunContext(env="gym_env", overrides={"env.name": "CartPole-v0"})
rc.train()
The -cn conf_train
argument specifies that we would like to use
conf_train.yaml
as our root config file. This is
needed as by default, configuration for rollouts is used.
Furthermore, we specify that gym_env
configuration should be used, with
CartPole-v0
as the Gym environment name. (For more information on how to read and customize the default configuration files,
see Hydra overview.)
Such a training run consists of these main stages, loaded based on the default configuration provided by Maze:
The full configuration is assembled via Hydra based on the config files available, the defaults set in root config, and the overrides you provide via CLI (see Hydra overview to understand more about this process).
RunContext
allows to do this from within your Python script and offers some convenient functionality, e.g. passing object instances, on top. See here for more information on the differences for training and rollout between the Python API, i.e.RunContext
, and the CLI, i.e.maze-run
.Hydra creates the output directory where all output files will be stored.
The full configuration of the job is logged: (1) to standard output, (2) as a text entry to your Tensorboard logs, and (3) as a YAML file in the output directory.
If the observation normalization wrapper is present, observation normalization statistics are collected and stored (note that no wrappers are applied by default).
Policies and critics are initialized and their graphical depictions saved.
The training starts, statistics are displayed in console and stored to a Tensorboard file, and current best model versions are saved (by default to
state_dict.pt
file).Once the training is done, final evaluation runs are performed and final model versions saved. (When the training is done depends on the training runner. Usually, this is specified using the
runner.n_epochs
argument, but the training can also end with early stopping if there is no more improvement).
As the job is running, you should see the statistics from the training and evaluation runs printed in your console, as mentioned in the 6. step:
...
********** Iteration 3 **********
step|path | value
=====|========================================================================================|====================
4|eval DiscreteActionEvents action substep_0/action | [len:281, μ:0.5]
4|eval BaseEnvEvents reward median_step_count | 18.500
4|eval BaseEnvEvents reward mean_step_count | 28.100
4|eval BaseEnvEvents reward total_step_count | 928.000
4|eval BaseEnvEvents reward total_episode_count | 40.000
4|eval BaseEnvEvents reward episode_count | 10.000
4|eval BaseEnvEvents reward std | 16.447
4|eval BaseEnvEvents reward mean | 28.100
4|eval BaseEnvEvents reward min | 16.000
4|eval BaseEnvEvents reward max | 66.000
-> new overall best model 28.10000!
...
This main structure remains similar for all environment and training configurations.
Example 2: Customizing with Provided Components¶
When your Maze job is launched using maze-run
from the CLI, the following happens under
the hood:
A job configuration is assembled by putting available configuration files together with the overrides you specify as arguments to the run command. More on that can be found in configuration documentation page, specifically in Hydra overview.
The complete assembled configuration is handed over to the Maze runner specified in the configuration (in the
runner
group). This runner then launches and manages the training (or any other) job.
The common points for customizing the training run correspond to the configuration groups listed in the training root config file, namely:
Environment (
env
configuration group), configuring which environment the training runs on, as well as customizing any other inner configuration of the environment, if available (like raw piece size in 2D cutting environment)Training algorithm (
algorithm
configuration group), specifying the algorithm used and configuration for itModel (
model
configuration group), specifying how the models for policies and (optionally) critics should be assembledRunner (
runner
configuration group), specifying options for how the training is run (e.g. locally, in development mode, or using Ray on a Kubernetes cluster). The runner is also the main object responsible for administering the whole training run (and runners are thus specific to individual algorithms used).
Maze provides a host of configuration files useful for working with standard Gym environments and environments provided by Maze (such as the 2D cutting environment). Hence, to use these, it suffices to supply appropriate overrides, without writing any additional configuration files.
By default, the gym_env
configuration is used, which allows us to specify the Gym env
that we would like to instantiate:
$ maze-run -cn conf_train env=gym_env env.name=LunarLander-v2
rc = RunContext(env="gym_env", overrides={"env.name": "LunarLander-v2"})
rc.train()
With appropriate overrides, we can also include vector observation model and wrappers (providing normalization):
$ maze-run -cn conf_train env=gym_env env.name=LunarLander-v2 wrappers=vector_obs model=vector_obs
rc = RunContext(
env="gym_env",
overrides={"env.name": "LunarLander-v2"},
wrappers="vector_obs",
model="vector_obs"
)
rc.train()
Alternatively, we could use the tutorial Cutting 2D environment:
$ maze-run -cn conf_train env=tutorial_cutting_2d_struct_masked \
wrappers=tutorial_cutting_2d model=tutorial_cutting_2d_struct_masked
rc = RunContext(
env="tutorial_cutting_2d_struct_masked",
wrappers="tutorial_cutting_2d",
model="tutorial_cutting_2d_struct_masked"
)
rc.train()
Further, by default, the algorithm used is Evolution Strategies (the implementation is provided by Maze). To use a different algorithm, e.g. PPO with a shared critic, we just need to add the appropriate overrides:
$ maze-run -cn conf_train algorithm=ppo env=tutorial_cutting_2d_struct_masked \
wrappers=tutorial_cutting_2d model=tutorial_cutting_2d_struct_masked
rc = RunContext(
env="tutorial_cutting_2d_struct_masked",
wrappers="tutorial_cutting_2d",
model="tutorial_cutting_2d_struct_masked",
algorithm="ppo"
)
rc.train()
To see all the configuration files available out-of-the-box, check out the maze/conf
package.
Example 3: Resuming Previous Training Runs¶
In case a training run fails (e.g. because your server goes down) there is no need to restart training
entirely from scratch. You can simply pass a previous experiment as an input_dir
and the Maze trainers will
initialize the model weights including all other relevant artifacts such as normalization statistics from the provided
directory. Below you find a few examples where this might be useful.
This is the initial training run:
$ maze-run -cn conf_train env=gym_env env.name=LunarLander-v2 algorithm=ppo
Once trained we can resume this run with:
rc = RunContext(env="gym_env", overrides={"env.name": "LunarLander-v2"}, algorithm="ppo")
rc.train()
We could also resume training with a refined learning rate:
$ maze-run -cn conf_train env=gym_env env.name=LunarLander-v2 algorithm=ppo \
algorithm.lr=0.0001 input_dir=outputs/<experiment-dir>
or even with a different (compatible) trainer such as a2c:
rc = RunContext(
env="gym_env",
run_dir="outputs/<experiment_dir>",
overrides={"env.name": "LunarLander-v2", "algorithm.lr": 0.0001},
algorithm="ppo"
)
rc.train()
Training in Your Custom Project¶
While the default environments and configurations are nice to get started quickly or test different approaches in standard scenarios, the primary focus of Maze are fully custom environments and models solving real-world problems (which are of course much more fun as well!).
The best place to start with a custom environment is the Maze step by step tutorial (mentioned already in the previous section) showing how to implement a custom Maze environment from scratch, along with respective configuration files (see also Hydra: Your Own Configuration Files).
Then, you can easily launch your environment by supplying your own configuration file (here we use one from the tutorial):
$ maze-run -cn conf_train env=tutorial_cutting_2d_struct_masked \
wrappers=tutorial_cutting_2d model=tutorial_cutting_2d_struct_masked
rc = RunContext(
env="tutorial_cutting_2d_struct_masked",
wrappers="tutorial_cutting_2d",
model="tutorial_cutting_2d_struct_masked"
)
rc.train()
For links to more customization options (like building custom models with Maze Perception Module), check out the Where to Go Next section.
While customizing other configuration groups listed in the previous section
(e.g., algorithm
, runner
) is not needed as often, all of these can be customized
in an analogous way (i.e., implement your own components that plug into the framework
instead of the default ones, and then add your own config
to be able to configure them from the command line). When using the Python API with RunContext
, you can also bypass configuration files and plug in your instantiated components directly.
Plain Python Training¶
Maze offers training also from within your Python script. This can be achieved manually - by generating all necessary components yourself - or managed - by utilizing RunContext
, which provides managed training and rollout capabilities.
Managed Setup
RunContext
initializes a context for running training and rollout with a shared configuration. Its functionality and interfaces are mostly congruent with the CLI’s, however there are some significant changes (e.g. being able to pass instantiated Python objects instead of relying exclusively on configuration dictionary objects or files). See here for a more thorough introduction.
Manual Setup
In most use cases, it will probably be more convenient to launch training directly from the CLI and just implement your custom components (wrappers, environments, models, etc.) as needed. However, the inner architecture of Maze should be sufficiently modular to allow you to modify just the parts that you want.
Because each of the algorithms included in Maze has slightly different needs, the usage will
likely slightly differ. However, regardless of which algorithm you intend to use,
the TrainingRunner
subclasses offer good examples of what components you will need for launching
training directly from Python.
Specifically, you’ll need to concentrate on the run
method, which takes as an argument the full
assembled hydra configuration (which is printed to the command line every time you launch a job).
Usually, the run method does roughly the following:
Instantiates the environment and policy components (some of this functionality is provided by the shared
TrainingRunner
superclass, as a large part of that is common for all training runners)Assembles the policy and critics into a structured policy
Instantiates the trainer and any other components needed for training
Launches the training
TrainingRunner
initializes the runner and TrainingRunner
runs the training.
For example, these are the setup
and run
methods taken directly from the evolution strategies runner:
@override(TrainingRunner)
def setup(self, cfg: DictConfig) -> None:
"""
Setup the training master node.
"""
super().setup(cfg)
# --- init the shared noise table ---
print("********** Init Shared Noise Table **********")
self.shared_noise = SharedNoiseTable(count=self.shared_noise_table_size)
# --- initialize policies ---
torch_policy = TorchPolicy(networks=self._model_composer.policy.networks,
distribution_mapper=self._model_composer.distribution_mapper, device="cpu")
torch_policy.seed(self.maze_seeding.agent_global_seed)
# support policy wrapping
if self._cfg.algorithm.policy_wrapper:
policy = Factory(Policy).instantiate(
self._cfg.algorithm.policy_wrapper, torch_policy=torch_policy)
assert isinstance(policy, Policy) and isinstance(policy, TorchModel)
torch_policy = policy
print("********** Trainer Setup **********")
self._trainer = ESTrainer(
algorithm_config=cfg.algorithm,
torch_policy=torch_policy,
shared_noise=self.shared_noise,
normalization_stats=self._normalization_statistics
)
# initialize model from input_dir
self._init_trainer_from_input_dir(trainer=self._trainer, state_dict_dump_file=self.state_dict_dump_file,
input_dir=cfg.input_dir)
self._model_selection = BestModelSelection(dump_file=self.state_dict_dump_file, model=torch_policy,
dump_interval=self.dump_interval)
@override(TrainingRunner)
def run(
self,
n_epochs: Optional[int] = None,
distributed_rollouts: Optional[ESDistributedRollouts] = None,
model_selection: Optional[ModelSelectionBase] = None
) -> None:
"""
See :py:meth:`~maze.train.trainers.common.training_runner.TrainingRunner.run`.
:param distributed_rollouts: The distribution interface for experience collection.
:param n_epochs: Number of epochs to train.
:param model_selection: Optional model selection class, receives model evaluation results.
"""
print("********** Run Trainer **********")
env = self.env_factory()
env.seed(self.maze_seeding.generate_env_instance_seed())
# run with pseudo-distribution, without worker processes
self._trainer.train(
n_epochs=self._cfg.algorithm.n_epochs if n_epochs is None else n_epochs,
distributed_rollouts=self.create_distributed_rollouts(
env=env, shared_noise=self.shared_noise,
agent_instance_seed=self.maze_seeding.generate_agent_instance_seed()
) if distributed_rollouts is None else distributed_rollouts,
model_selection=self._model_selection if model_selection is None else model_selection
)
Where to Go Next¶
After training, you might want to roll out the trained policy to further evaluate it or record the actions taken.
To create a custom Maze environment, you might want to review Maze environment hierarchy and creating a Maze environment from scratch.
To build custom Maze models, have a look at the Maze Perception Module.
To better understand how to configure custom environments and other components of your project, you might want to review the more advanced parts of configuration with Hydra.
For an introduction into training (and rolling out) conveniently with
RunContext
, head here.
Rollouts¶
During rollouts, the agent interacts with a given environment, issuing actions obtained from a given policy (be it a heuristic or a trained policy).
Usually, the purpose of rollouts is either evaluation (or even deployment) of a given policy in a given environment, or collection of trajectory data. Collected trajectory data can later be used for further learning (e.g. imitation learning) or for inspecting the policy behavior more closely using trajectory viewers.

On this page:
The First Rollout demonstrates the main mechanics of running a rollout.
Rollout Runner Configuration explains how to configure the rollout runners.
Environment and Agent Configuration shows how to configure different environments and agents.
Finally, Plain Python Configuration shows how to run rollouts without the CLI.
The First Rollout¶
Rollouts can be run from the command line, using the maze-run
command.
Rollout configuration (conf_rollout
) is used by default. Hence, to run
your first rollout, it suffices to execute:
$ maze-run env=gym_env env.name=CartPole-v0
This runs a rollout of a random policy on cartpole
environment. Statistics
from the rollout are printed to the console, and trajectory data with event logs
are stored in the output directory automatically configured by Hydra.
Alternatively, we might configure the rollouts to run just one episode in sequential mode and render the env (but more on that and other configuration options below):
$ maze-run env=gym_env env.name=CartPole-v0 runner=sequential runner.n_episodes=1 runner.render=true
Rollout Runner Configuration¶
Rollouts are run by rollout runners, which are agent- and environment-agnostic (for configuring environments and agents, see the following section).
By default, rollouts are run in multiple processes in parallel (as can be
seen in the rollout configuration file, which lists runner: parallel
in the defaults), and are handled by the ParallelRolloutRunner.
Alternatively, rollouts can be run sequentially in a single process by
opting for the sequential
runner configuration:
$ maze-run env=gym_env env.name=CartPole-v0 runner=sequential
This is mainly useful when running a single episode only or for debugging, as sequential rollouts are much slower.
The available configuration options for both scenarios are listed in
the Hydra runner package (conf/runner/
).
These are the parameters for parallel
rollout runner:
# @package _group_
_target_: maze.core.rollout.parallel_rollout_runner.ParallelRolloutRunner
# Number of processes to run the rollouts in concurrently
n_processes: 5
# Total number of episodes to run
n_episodes: 50
# Max steps per episode to perform
max_episode_steps: 200
# If true, trajectory data will be recorded and stored in :code:`trajectory_data` directory
record_trajectory: true
# If true, event logs will be recorded and stored in `event_logs_directory
record_event_logs: true
# (Note that the default output directory is handled by Hydra)
Using these parameters, we can modify the rollout to e.g. be run only in 3 processes, and be comprised of 100 episodes, each of max 50 steps:
$ maze-run env=gym_env env.name=CartPole-v0 runner.n_processes=3 \
runner.n_episodes=100 runner.max_episode_steps=10
(Alternatively, you can create your own configuration file that you will then
supply to the maze-run
command as described in Hydra primer section).
Environment and Policy Configuration¶
Environment and policy are configured using the env
, resp. policy
Hydra
packages. Rollout runners are environment- and agent-agnostic, and will attempt
to instantiate the type specified in the config files using Maze Factory.
Environment is expected to conform to the StructuredEnv
interface and agent to
the StructuredPolicy
interface.
For agents, there are the following example config files:
policy/random_policy.yaml
for instantiating a class that conforms to theStructuredPolicy
interface directlypolicy/cutting_2d_greedy_policy
(inmaze-envs/logistics
) for wrapping (potentially multiple) flat policies into a structured policypolicy/torch_policy
(inmaze/train
) for loading and rolling out a policy trained using the Maze framework
Hence, after training a policy on the tutorial Cutting 2D environment:
$ maze-run -cn conf_train env=tutorial_cutting_2d_struct_masked
wrappers=tutorial_cutting_2d model=tutorial_cutting_2d_struct_masked
We can roll it out using:
$ maze-run policy=torch_policy env=tutorial_cutting_2d_struct_masked wrappers=tutorial_cutting_2d \
model=tutorial_cutting_2d_struct_masked input_dir=outputs/[training-output-dir]
Note that for this to work, the training-output-dir
parameter must be set to the output directory
of the training run (the model state dict and other configuration will be loaded from there).
Plain Python Configuration¶
Rollout runners are primarily designed to support running through Hydra from command line. That being said, you can of course instantiate and use the runners directly in Python if you have some special needs.
from maze.core.agent.dummy_cartpole_policy import DummyCartPolePolicy
from maze.core.rollout.sequential_rollout_runner import SequentialRolloutRunner
from maze.core.wrappers.maze_gym_env_wrapper import GymMazeEnv
# Instantiate an example environment and agent
env = GymMazeEnv("CartPole-v0")
agent = DummyCartPolePolicy()
# Run a sequential rollout with rendering
# (including an example wrapper the environment will be wrapped in)
sequential = SequentialRolloutRunner(
n_episodes=10,
max_episode_steps=100,
record_trajectory=True,
record_event_logs=True,
render=True)
sequential.run_with(
env=env,
wrappers={"maze.core.wrappers.reward_scaling_wrapper.RewardScalingWrapper": {"scale": 0.1}},
agent=agent)
Using the snippet above, you can run a rollout on any agent and environment directly from Python (parallel rollouts can be run similarly).
However, note that the rollout runners are currently designed to be run only once (which is their main use case for runs initiated from the command line). Running them repeatedly might cause some issues especially with statistics and event logging, as the runners initiate new writers every time (so you might get duplicate outputs) and some of these operations are order-sensitive (especially for the parallel rollouts where some state might be carried over to child processes).
Where to Go Next¶
If you collected trajectory data during the rollout, you might want to:
Visualize the collected rollout data in a trajectory viewer notebook
Use the collected data for imitation learning
Deployment¶
In an experimental setting, deploying an agent means running a rollout on a given test environment and evaluating the results. However, in a real-world scenario, when we are dealing with a production environment, running a rollout is usually not so easily feasible.
The main difference between experimental rollouts and real-world environments is that production environments often do not follow the Gym model and cannot be easily stepped. Instead, the control flow is inverted, with the environment querying the agent for an action, whenever it is ready.

The catch here is that in the Gym environment model, the wrappers that modify the environment behavior are considered to be a part of the environment (i.e., they are stepped together with it). However, during deployment, the production environment expects the wrapper stack to be maintained by the agent (after all, the production environment should not concern itself with the likes of observation post-processing and step-skipping).
The AgentDeployment
component in Maze deals with
exactly this matter – packaging the policy together with the wrapper
stack and other components, so you can only call act
in
a production setting and get a processed action back, with things
like statistics logging and observation frame stacking staying intact.
Building a Deployment Agent¶
There are two ways to build a deployment agent – either from an already-instantiated policy and environment (which may include a stack of wrappers):
from maze.core.agent_deployment.agent_deployment import AgentDeployment
from maze.test.shared_test_utils.dummy_env.agents.dummy_policy import DummyGreedyPolicy
from maze.test.shared_test_utils.helper_functions import build_dummy_maze_env
agent_deployment = AgentDeployment(
policy=DummyGreedyPolicy(),
env=build_dummy_maze_env()
)
Or, by providing configuration dictionary for the policy and environment (and, optionally, for wrappers) obtained from hydra or elsewhere:
from maze.core.agent_deployment.agent_deployment import AgentDeployment
from maze.core.utils.config_utils import read_hydra_config
# Note: Needs to be a rollout config (`conf_rollout`), so the policy config is present as well
cfg = read_hydra_config(config_module="maze.conf", config_name="conf_rollout", env="gym_env")
agent_deployment = AgentDeployment(
policy=cfg.policy,
env=cfg.env,
wrappers=cfg.wrappers
)
(The configuration structure here is shared with rollouts. To better understand it, see Rollouts.)
Alternatively, you can mix and match these approaches, providing an already-instantiated policy and an environment config, or vice versa.
After that, you can already start querying the agent for actions using
the act
method:
maze_action = agent_deployment.act(maze_state, reward, done, info)
When the episode is done, you should close the agent deployment. At this point, the agent deployment resets the env to write out statistics and ensure all wrappers finish the episode properly.
Note
Ensure that you query the agent for actions from a single episode, in order as the states are encountered. Otherwise, parts of the wrapper stack (like stats logging or observation frame stacking) might become inconsistent, leading to passing wrong observations to the policy.
Note
Currently, the Agent Deployment supports a single episode only.
Once the episode is done, close
the deployment and initialize a new
instance. Support for continued resets will likely be added in the future.
The full working example below demonstrates the agent deployment
on Gym CartPole environment, where we initialize the agent deployment
and then use the external_env
to simulate an external production env (to obtain states from),
looks like:
import gym
from maze.core.agent.random_policy import RandomPolicy
from maze.core.agent_deployment.agent_deployment import AgentDeployment
from maze.core.wrappers.maze_gym_env_wrapper import GymMazeEnv
env = GymMazeEnv("CartPole-v0")
policy = RandomPolicy(action_spaces_dict=env.action_spaces_dict)
agent_deployment = AgentDeployment(
policy=policy,
env=env
)
# Simulate an external production environment that does not use Maze
external_env = gym.make("CartPole-v0")
maze_state = external_env.reset()
reward, done, info = 0, False, {}
for i in range(10):
# Query the agent deployment for maze action, then step the environment with it
maze_action = agent_deployment.act(maze_state, reward, done, info)
maze_state, reward, done, info = external_env.step(maze_action)
agent_deployment.close(maze_state, reward, done, info)
Notice that above, we are dealing with Maze states and Maze actions, i.e., in the format they come directly from the environment. The translation to policy-friendly format of actions and observations is handled as part of the wrapper stack (where they are passed through action/observation conversion interfaces and the individual wrappers).
How does this work under the hood?¶
When initializing the AgentDeployment
, an existing
environment including the wrapper stack is taken (or first
initialized, if environment configuration was passed in).
Then, the core environment (the component just under the maze environment,
providing the core functionality – see Maze Environment Hierarchy)
is swapped out for so called ExternalCoreEnv
, and then
executed in a run loop with the provided policy on a separate thread.
The external core env hijacks execution of the step function, and
pauses the thread, waiting until new maze_state
object
(with associated reward, etc.) is passed in from the AgentDeployment
,
which runs on the main thread.
Despite what this looks like at the first glance, this is not a concurrent setup. The threads are used only for hijacking the execution in the step function, and they never run concurrently during env stepping.
Either the main thread, or the second thread are always paused. First, the environment on the second thread is waiting for obtaining the next maze state from agent deployment on the first thread. Then, agent deployment is waiting for the environment run loop to iterate to the next step function, and return the processed action back.
Where to Go Next¶
To better understand the hierarchy of Core Env, Maze Env and Wrappers and how they affect the execution flow, see Maze Environment Hierarchy
Before deployment, you might want to train your own policy and then evaluate it using rollouts.
Collecting and Visualizing Rollouts¶
While the Event System provides an overview of notable events happening during an episode through statistics and event logs, it is often needed to dig deeper and visualize the full environment state at a given time step.
With the Maze Trajectory Viewer, it is possible replay past episodes from collected trajectory data in a Jupyter Notebook.

Requirements¶
Note
Rollouts visualization in a notebook is not currently available for Gym environments.
The trajectory viewer notebook requires the environment to implement
a Maze-compatible Renderer
based on matplotlib. The tutorial 2D cutting
environment serves as a perfect example – see the
Adding a Renderer section to understand
how to implement one.
Unfortunately, Maze does not yet support rendering from trajectory data for
standard Gym environments. For such environments, you can render only during
the rollout itself by setting the corresponding option on the sequential renderer
(i.e., provide the following overrides for rollouts:
runner=sequential runner.render=true
).
Trajectory Data Collection¶
When using a compliant environment, past trajectories can be rendered directly from the trajectory data. These are usually collected using the rollout runners via CLI.
To simply collect trajectory data of a heuristic policy on the tutorial Cutting 2D environment, run:
$ maze-run env=tutorial_cutting_2d_flat policy=tutorial_cutting_2d_greedy_policy
Alternatively (and closer to a real training setting), you might want to first train an RL policy on the tutorial 2D cutting environment:
$ maze-run -cn conf_train env=tutorial_cutting_2d_struct_masked
wrappers=tutorial_cutting_2d model=tutorial_cutting_2d_struct_masked
and then roll it out to collect the trajectory data (make sure
to substitute the input_dir
value for your actual training output directory):
$ maze-run policy=torch_policy env=tutorial_cutting_2d_struct_masked wrappers=tutorial_cutting_2d \
model=tutorial_cutting_2d_struct_masked input_dir=outputs/[training-output-dir]
Once the rollout has run, take note of the outputs directory created by
Hydra, where the trajectory data will be logged – by default inside
the trajectory_data
subdirectory, one pickle file per episode
(identified by a UUID generated for each episode).
(Whether trajectory data is recorded during a rollout is set using
the runner.record_trajectory
flag, which is on by default.)
Trajectory Visualization¶
Maze includes a Jupyter Notebook in evaluation/viewer.ipynb
that will guide
you through the process. You only need to supply a path to the outputs directory
where you trajectory data reside. The renderer will be automatically built
from the trajectory data.
(Note that the notebook also lists example trajectory data in case you do not have any on hand.)
Once an episode is selected and loaded, it is possible to skim back and forward in time using the notebook widgets slider (controllable by mouse or keyboard).

Where to Go Next¶
To understand in more detail how to train a policy and then roll it out to collect trajectory data, check out Trainings and Rollouts.
Rendering and reviewing each time step in detail comes with a lot of overhead. In case you just want to see and easily compare notable events that happened across different episodes, you might want to review the Event system and how it is used to log statistics, KPIs, and raw events.
Imitation Learning and Fine-Tuning¶
Imitation learning refers to the task of learning a policy by imitating the behaviour of an existing teacher policy usually represented as a fixed set of example trajectories. In some scenarios we might even have direct access to the actual teacher policy itself allowing us to generate as many training trajectories as required. Imitation learning is especially useful for initializing a policy to quick-start an actual training by interaction run or for settings where no training environment is available at all (e.g., offline RL).
Since imitation learning involves rollouts, this is as of yet not supported by RunContext
. A guide for managed pure-Python imitation learning will be provided together will rollout support.

Overview:
Collect Training Trajectory Data¶
This section explains how to rollout a policy for collecting example trajectories. As the training trajectories might be already available (e.g., collected in practice) this step is optional.
As an example environment we pick the discrete version of the LunarLander environment as it already provides a heuristic policy which we can use to collect or training trajectories for imitation learning.

But first let’s check if the policy actually does something meaningful by running a few rendering rollouts:
maze-run env.name=LunarLander-v2 policy=lunar_lander_heuristics \
runner=sequential runner.render=true runner.n_episodes=3
Hopefully this looks good and we can continue with actually collecting example trajectories for imitation learning.
The command bellow performs 3 rollouts of the heuristic policy and records them to the output directory.
maze-run env.name=LunarLander-v2 policy=lunar_lander_heuristics runner.n_episodes=3
You will get the following output summarizing the statistics of the rollouts.
step|path | value
=====|======================================================================|================
1|rollout_stats DiscreteActionEvents action| substep_0/action |[len:583, μ:1.2]
1|rollout_stats BaseEnvEvents reward| median_step_count | 200.000
1|rollout_stats BaseEnvEvents reward| mean_step_count | 194.333
1|rollout_stats BaseEnvEvents reward| total_step_count | 583.000
1|rollout_stats BaseEnvEvents reward| total_episode_count | 3.000
1|rollout_stats BaseEnvEvents reward| episode_count | 3.000
1|rollout_stats BaseEnvEvents reward| std | 51.350
1|rollout_stats BaseEnvEvents reward| mean | 190.116
1|rollout_stats BaseEnvEvents reward| min | 121.352
1|rollout_stats BaseEnvEvents reward| max | 244.720
The trajectories will be dumped similar to the file structure shown below.
- outputs/<experiment_path>
- maze_cli.log
- event_logs
- trajectory_data
- 00653455-d7e2-4737-a82b-d6d1bfce12f7.pkl
- ...
The pickle files contain the distinct episodes recorded as
StateTrajectoryRecord
objects,
each containing a sequence of
StateRecord
objects,
which keep the trajectory data for one step (state, action, reward, …).
Learn from Example Trajectories¶
Given the trajectories recorded in the previous step we now train a policy with behavioral cloning, a simple version of imitation learning.
To do so we simply provide the trajectory data as an argument and run:
maze-run -cn conf_train env.name=LunarLander-v2 model=vector_obs wrappers=vector_obs \
algorithm=bc algorithm.validation_percentage=50 \
runner.dataset.dir_or_file=<absolute_experiment_path>/trajectory_data
...
********** Epoch 24: Iteration 1500 **********
step|path | value
=====|========================================================================|=========
96|train ImitationEvents discrete_accuracy 0/action | 0.948
96|train ImitationEvents policy_loss 0 | 0.150
96|train ImitationEvents policy_entropy 0 | 0.209
96|train ImitationEvents policy_l2_norm 0 | 42.416
96|train ImitationEvents policy_grad_norm 0 | 0.870
step|path | value
=====|========================================================================|=========
96|eval ImitationEvents discrete_accuracy 0/action | 0.947
96|eval ImitationEvents policy_loss 0 | 0.152
96|eval ImitationEvents policy_entropy 0 | 0.207
-> new overall best model -0.15179!
...
As with all trainers, we can watch the training progress with Tensorboard.
tensorboard --logdir outputs/

Once training is complete we can check how the behaviourally cloned policy performs in action.
maze-run env.name=LunarLander-v2 model=vector_obs wrappers=vector_obs \
policy=torch_policy input_dir=outputs/<imitation-learning-experiment>
step|path | value
=====|=====================================================================|=================
1|rollout_stats DiscreteActionEvents action substep_0/action |[len:8033, μ:1.2]
1|rollout_stats BaseEnvEvents reward median_step_count | 186.000
1|rollout_stats BaseEnvEvents reward mean_step_count | 160.660
1|rollout_stats BaseEnvEvents reward total_step_count | 8033.000
1|rollout_stats BaseEnvEvents reward total_episode_count | 50.000
1|rollout_stats BaseEnvEvents reward episode_count | 50.000
1|rollout_stats BaseEnvEvents reward std | 111.266
1|rollout_stats BaseEnvEvents reward mean | 101.243
1|rollout_stats BaseEnvEvents reward min | -164.563
1|rollout_stats BaseEnvEvents reward max | 282.895
With a mean reward of 101 this already looks like a promising starting point for RL fine-tuning.
Fine-Tune a Pre-Trained Policy¶
In the last section we show how to fine-tune the pre-trained policy with a model-free RL learner such as PPO. It is basically a standard PPO training run initialized with the imitation learning output.
maze-run -cn conf_train env.name=LunarLander-v2 model=vector_obs critic=template_state wrappers=vector_obs \
algorithm=ppo runner.eval_repeats=100 runner.critic_burn_in_epochs=10 \
input_dir=outputs/<imitation-learning-experiment>
Once training started we can observe the progress with Tensorboard (for the sake of clarity of this example we renamed the experiment directories for the screenshot below).
The Tensorboard log below compares the following experiments:
a randomly initialized policy trained with learning rate 0.0 (random-PPO-lr0)
a behavioural cloning pre-trained policy trained with learning rate 0.0 (pre_trained-PPO-lr0)
a randomly initialized policy trained with PPO (from_scratch-PPO)
a behavioural cloning pre-trained policy trained with PPO (pre_trained-PPO)
We also included training runs with a learning rate of 0.0 to get a feeling for the performance of the initial performance of the two models (randomly initialized vs. pre-trained).

As expected, we see that PPO fine-tuning of the pretrained model starts at an initially much higher reward level compared to the model trained entirely from scratch.
Although this is a quite simple example it is still a nice showcase for the usefulness of this two-stage learning paradigm. For scenarios with delayed and/or sparse rewards following this principle is often crucial to get the RL trainer to start learning at all.
Where to Go Next¶
You can find more details on training and rollouts on the dedicated pages.
You can also read up on how to visualize recorded rollouts.
For further details on the learning algorithms you can visit the Trainers page.
Experiment Configuration¶
Launching experiments with the Maze command line interface (CLI) is based on the Hydra configuration system and hence also closely follows Hydra’s experimentation workflow. In general, there are different options for carrying out and configuring experiments with Maze. (To see experiment configuration in action, check out our project template.)
Overview
Command Line Overrides¶
To quickly play around with parameters in an interactive (temporary) fashion you can utilize Hydra command line overrides to reset parameters specified in the default config (e.g., conf_train).
$ maze-run -cn conf_train env.name=CartPole-v0 algorithm=ppo algorithm.lr=0.0001
rc = RunContext(
algorithm="ppo",
overrides={
"env.name": "CartPole-v0",
"algorithm.lr": 0.0001,
}
)
rc.train()
The example above changes the trainer to PPO and optimizes with a learning rate of 0.0001. You can of course override any other parameter of your training and rollout runs.
For an in depth explanation of the override concept we refer to our Hydra documentation.
Experiment Config Files¶
For a more persistent way of structuring your experiments you can also make use of Hydra’s built-in Experiment Configuration.
This allows you to maintain multiple experimental config files each only specifying the changes to the default config (e.g., conf_train).
# @package _global_
# defaults to override
defaults:
- override /algorithm: ppo
- override /wrappers: vector_obs
# overrides
algorithm:
lr: 0.0001
The experiment config above sets the trainer to PPO, the learning rate to 0.0001 and additionally activates the vector_obs wrapper stack.
To start the training run with this config file, run:
$ maze-run -cn conf_train +experiment=cartpole_ppo_wrappers
rc = RunContext(experiment="cartpole_ppo_wrappers")
rc.train()
You can find a more detail explanation on how experiments are embedded in the overall configuration system in our Hydra experiment documentation.
Hyper Parameter Grid Search¶
To perform a simple grid search over selected hyper parameters you can use Hydra’s Sweeper which converts lists of command line arguments into distinct jobs.
The example below shows how to launch the same experiment with three different learning rates.
$ maze-run -cn conf_train env.name=CartPole-v0 algorithm=ppo \
algorithm.n_epochs=5 algorithm.lr=0.0001,0.0005,0.001 --multirun
rc = RunContext(
algorithm="ppo",
overrides={
"env.name": "CartPole-v0",
"algorithm.n_epochs": 5,
"algorithm.lr": [0.0001,0.0005,0.001]
},
multirun=True
)
rc.train()
We then recommend to compare the different configurations with Tensorboard.
tensorboard --logdir outputs/
Within tensorboard the hyperparameters of the grid search are logged as well, which makes comparison between runs more convenient as can be seen in the figure below:

Per default Hydra uses the local (sequential) runner for processing jobs.
For setting up a more scalable (local, parallel) grid search
we recommend to create an experiments file for configuration.
As a starting point Maze already contains a simple local grid search setting
based on the built-in MazeLocalLauncher
.
# @package _global_
# defaults to override
defaults:
- override /runner: local
- override /hydra/launcher: local
# set training runner concurrency
runner:
concurrency: 0
# set grid search concurrency
hydra:
launcher:
# maximum number of parallel grid search jobs
# if -1, this is set to the number of CPUs
n_jobs: 4
# Hint: make sure that runner.concurrency * hydra.launcher.n_jobs <= CPUs
To repeat the grid search from above, but this time with multiple parallel workers, run:
$ maze-run -cn conf_train env.name=CartPole-v0 algorithm=ppo \
algorithm.n_epochs=5 algorithm.lr=0.0001,0.0005,0.001 +experiment=grid_search --multirun
Besides the built-in MazeLocalLauncher
,
there are also more scalable options available with Hydra.
Hyperparameter Optimization¶
Maze also support hyper parameter optimization beyond vanilla grid search via Nevergrad (in case you have enough resources available).
Note
Hyperparameter optimization is not supported by RunContext yet.
You can start with the experiment template below and adopt it to your needs (for details on how to define the search space we refer to the Hydra docs and this example).
# @package _global_
# defaults to override
defaults:
- override /algorithm: ppo
- override /hydra/sweeper: nevergrad
- override /hydra/launcher: local
- override /runner: local
# set training runner concurrency
runner:
concurrency: 0
# overrides
hydra:
sweeper:
optim:
# name of the nevergrad optimizer to use
# OnePlusOne is good at low budget, but may converge early
optimizer: OnePlusOne
# total number of function evaluations to perform
budget: 100
# number of parallel workers for performing function evaluations
num_workers: 4
# we want to maximize reward
maximize: true
# default parametrization of the search space
parametrization:
# a linearly-distributed scalar
algorithm.lr:
lower: 0.00001
upper: 0.001
algorithm.entropy_coef:
lower: 0.0000025
upper: 0.025
# Hint: make sure that runner.concurrency * hydra.sweeper.optim.num_workers <= CPUs
To start a hyper parameter optimization, run:
$ maze-run -cn conf_train env.name=Pendulum-v0 \
algorithm.n_epochs=5 +experiment=nevergrad --multirun
Where to Go Next¶
Here you can learn how to set up a custom configuration/experimentation module.
If you would like to learn about more advanced configuration options you can dive into the Hydra configuration system documentation.
Clone this project template repo to start your own Maze project.
Introducing the Perception Module¶
One of the key ingredients for successfully training RL agents in complex environments is their combination with powerful representation learners; in our case PyTorch-based neural networks. These enable the agent to perceive all kinds of observations (e.g. images, audio waves, sensor data, …), unlocking the full potential of the underlying RL-based learning systems.
Maze supports neural network building blocks via the Perception Module, which is responsible for transforming raw observations into standardized, learned latent representations. These representations are then utilized by the Action Spaces and Distributions Module to yield policy as well as critic outputs.

This page provides a general introduction into the Perception Module (which we recommend to read, of course). However, you can also start using the module right away and jump to the template or custom models section.
List of Features¶
Below we list the key features and design choices of the perception module:
Based on PyTorch.
Supports dictionary observation spaces.
Provides a large variety of neural network building blocks and model styles for customizing policy and value networks:
feed forward: dense, convolution, graph convolution and attention, …
recurrent: LSTM, last-step-LSTM, …
general purpose: action and observation masking, self-attention, concatenation, slicing, …
Provides shape inference allowing to derive custom models directly from observation space definitions.
Allows for environment specific customization of existing network templates per yaml configuration.
Definition of complex networks explicitly in Python using Maze perception blocks and/or PyTorch.
Generates detailed visualizations of policy and value networks (model graphs) containing the perception building blocks as well as all intermediate representation produced.
Can be easily extended with custom network components if necessary.
Perception Blocks¶
Perception blocks are components for composing models such as policy and value networks within Maze. They implement PyTorch’s nn.Module interface and encapsulate neural network functionality into distinct, reusable units. In order to handle all our requirements (listed in the motivation below), every perception block expects a tensor dictionary as input and produce a tensor dictionary again as an output.

Maze already supports a number of built-in neural network building blocks which are, like all other components, easily extendable.
Motivation: Maze introduces perception blocks to extend PyTorch’s nn.Module with shape inference to support the following features:
To derive, generate and customize template models directly from observation and action space definitions.
To visualize models and how these process observations to ultimately arrive at an action or value prediction.
To seamlessly apply models at different stages of the RL development processes without the need for extensive input reshaping regardless if we perform a distributed training using parallel rollout workers or if we deploy a single agent in production. The figure below shows a few examples of such scenarios.

Inference Blocks¶
The InferenceBlock, a special perception block, combines multiple perception blocks into one prediction module. This is convenient and allows us to easily reuse semantically connected parts of our models but also enables us to derive and visualize inference graphs of these models. This is feasible as perception blocks operate with input and output tensor dictionaries, which can be easily linked to an inference graph.
The figure below shows a simple example of how such a graph can look like.

Details:
The model depicted in the figure above takes two observations as inputs:
obs_inventory : a 16-dimensional feature vector
obs_screen : a 64 x 64 RGB image
obs_inventory is processed by a DenseBlock resulting in a 32-dimensional latent representation.
obs_screen is processed by a VGG-style model resulting in a 32-dimensional latent representation.
Next, these two representations are concatenated into a joint representation with dimension 64.
Finally we have two LinearOutputBlocks yielding the logits for two distinct action heads:
action_move : a categorical action deciding to move [UP, DOWN, LEFT, RIGHT],
action_use : a multi-binary action deciding which item to use from inventory.
Comments on visualization: Blue boxes are blocks, while red ones are tensors. The color depth of blocks (blue) indicates the number of the parameters relative to the total number of parameters.
Model Composers¶
Model Composers, as the name suggest, compose the models and as such bring all components of the perception module together under one roof. In particular, they hold:
Definitions of observation and actions spaces.
All defined models, that is, policies (multiple ones in multi-step scenarios) and critics (multiple ones in multi-step scenarios depending on the critic type).
The Distribution Mapper, mapping (possible custom) probability distributions to action spaces.
Maze supports different types of model composers and we will show how to work with template and custom models in detail later on.
Implementing Custom Perception Blocks¶
In case you would like to implement and use custom components when designing your models you can add new blocks by implementing:
The PerceptionBlock interface common for all perception blocks.
The ShapeNormalizationBlock interface normalizing the input and de-normalizing the output tensor dimensions if required for your block (optional).
The respective forward pass of your block.
The code-snippet below shows a simple toy-example block, wrapping a linear layer into a Maze perception block.
"""Contains a single linear layer block."""
import builtins
from typing import Union, List, Sequence, Dict
import torch
from torch import nn as nn
from maze.core.annotations import override
from maze.perception.blocks.shape_normalization import ShapeNormalizationBlock
Number = Union[builtins.int, builtins.float, builtins.bool]
class MyLinearBlock(ShapeNormalizationBlock):
"""A linear output block holding a single linear layer.
:param in_keys: One key identifying the input tensors.
:param out_keys: One key identifying the output tensors.
:param in_shapes: List of input shapes.
:param output_units: Count of output units.
"""
def __init__(self,
in_keys: Union[str, List[str]],
out_keys: Union[str, List[str]],
in_shapes: Union[Sequence[int], List[Sequence[int]]],
output_units: int):
super().__init__(in_keys=in_keys, out_keys=out_keys, in_shapes=in_shapes, in_num_dims=2, out_num_dims=2)
self.input_units = self.in_shapes[0][-1]
self.output_units = output_units
# initialize the linear layer
self.net = nn.Linear(self.input_units, self.output_units)
@override(ShapeNormalizationBlock)
def normalized_forward(self, block_input: Dict[str, torch.Tensor]) -> Dict[str, torch.Tensor]:
"""implementation of :class:`~maze.perception.blocks.shape_normalization.ShapeNormalizationBlock` interface
"""
# extract the input tensor of the first (and here only) input key
input_tensor = block_input[self.in_keys[0]]
# apply the linear layer
output_tensor = self.net(input_tensor)
# return the output tensor as a tensor dictionary
return {self.out_keys[0]: output_tensor}
def __repr__(self):
"""This is the text shown in the graph visualization."""
txt = self.__class__.__name__
txt += f"\nOut Shapes: {self.out_shapes()}"
return txt
The Bigger Picture¶
The figure below shows how the components introduced in the perception module relate to each other.

Where to Go Next¶
For further details please see the reference documentation.
Action Spaces and Distributions
Working with template models
Working with custom models
Pre-processing and observation normalization
Action Spaces and Distributions¶
In response to the states perceived and rewards received, RL agents interact with their environment by taking appropriate actions. Depending on the problem at hand there are different types of actions an agent must be able to deal with (e.g. categorical, binary, continuous, …).
To support this requirement Maze introduces the Distribution Module which builds on top of the Perception Module allowing to fully customize which probability distributions to link with certain action spaces or even individual action heads.
List of Features¶
The distribution module provides the following key features:
Supports flat dictionary action spaces (nested dict spaces are not yet supported)
Supports a variety of different action spaces and probability distributions
Supports customization of which probability distribution to use for which action space or head
Supports action masking in combination with the perception module
Allows to add custom probability distributions whenever required
Action Spaces and Probability Distributions¶
Maze so far supports the following action space - probability distribution combinations.
Action Space |
Available Distributions |
Discrete |
Categorical (default) |
Multi-Discrete |
Multi-Categorical (default) |
(Multi-)Binary |
Bernoulli (default) |
Box (Continuous) |
Diagonal-Gaussian (default), Beta, Squashed-Gaussian |
Dict |
DictProbabilityDistribution (default) |
The DictProbabilityDistribution combines any of the other action spaces and distributions into a joint action space in case you agent has to interact with the environment via different action space types at the same time.
Note that the table above does not always follow a one-to-one mapping. In case of a Box (Continuous) action space for example you can choose between a Diagonal-Gaussian distribution in case of an unbounded action space or a Beta or a Squashed-Gaussian distribution in case of a bounded action space. In other cases you might even want to add additional probability distributions according to the nature of the environment you are facing.
To allow for easy customization of the links between action spaces and distributions Maze introduces the DistributionMapper for which we show usage examples below.
Example 1: Mapping Action Spaces to Distributions¶
Adding the snippet below to your model config specifies the following:
Use Beta distributions for all Box action spaces.
All other action spaces behave as specified in the defaults.
# @package model
distribution_mapper_config:
- action_space: gym.spaces.Box
distribution: maze.distributions.beta.BetaProbabilityDistribution
Example 2: Mapping Actions to Distributions¶
Adding the snippet below to your model config specifies the following:
Use Beta distributions for all Box action spaces.
Use a Squashed-Gaussian distributions for the action with key “special_action”.
All other action spaces behave as specified in the defaults.
# @package model
distribution_mapper_config:
- action_space: gym.spaces.Box
distribution: maze.distributions.beta.BetaProbabilityDistribution
- action_head: special_action
distribution: maze.distributions.squashed_gaussian.SquashedGaussianProbabilityDistribution
When specifying custom behaviour for distinct action heads make sure to add them below the more general action space configurations (e.g. get more specific from top to bottom).
Example 3: Using Custom Distributions¶
In case the probability distributions contained in Maze are not sufficient for your use case you can of course add additional custom probability distributions.
# @package model
distribution_mapper_config:
- action_space: gym.spaces.Discrete
distribution: my_package.maze_extentions.distributions.CustomCategoricalProbabilityDistribution
The example above defines to use a CustomCategoricalProbabilityDistribution for all discrete action spaces. When adding a new distribution you (1) have to implement the ProbabilityDistribution interface and (2) make sure that it is accessible within your python path. Besides that you only have to provide the reference path of the probability distribution you would like to use.
Example 4: Plain Python Configuration¶
For completeness we also provide a code snippet in plain Python showing how to:
Define a simple policy network.
Instantiate a default DistributionMapper.
Use the DistributionMapper to compute the required logits shapes for the Policy network.
Compute action logits from a random observation.
Instantiate the appropriate probability distribution and sample actions.
"""Minimum working example showing how to sample actions from a policy network."""
from typing import Dict, Sequence
import torch
from gym import spaces
from torch import nn
from maze.distributions.distribution_mapper import DistributionMapper
OBSERVATION_NAME = 'my_observation'
ACTION_NAME = 'my_action'
class PolicyNet(nn.Module):
"""Simple feed forward policy network."""
def __init__(self,
obs_shapes: Dict[str, Sequence[int]],
action_logits_shapes: Dict[str, Sequence[int]]):
super().__init__()
self.net = nn.Sequential(
nn.Linear(in_features=obs_shapes[OBSERVATION_NAME][0], out_features=16), nn.Tanh(),
nn.Linear(in_features=16, out_features=action_logits_shapes[ACTION_NAME][0]))
def forward(self, in_dict: Dict[str, torch.Tensor]) -> Dict[str, torch.Tensor]:
""" forward pass. """
return {ACTION_NAME: self.net(in_dict[OBSERVATION_NAME])}
# init default distribution mapper
distribution_mapper = DistributionMapper(
action_space=spaces.Dict(spaces={ACTION_NAME: spaces.Discrete(2)}),
distribution_mapper_config={})
# request required action logits shape and init a policy net
logits_shape = distribution_mapper.required_logits_shape(ACTION_NAME)
policy_net = PolicyNet(obs_shapes={OBSERVATION_NAME: (4,)},
action_logits_shapes={ACTION_NAME: logits_shape})
# compute action logits (here from random input)
logits_dict = policy_net({OBSERVATION_NAME: torch.randn(4)})
# init action sampling distribution from model output
dist = distribution_mapper.logits_dict_to_distribution(logits_dict, temperature=1.0)
# sample action (e.g., {my_action: 1})
action = dist.sample()
The Bigger Picture¶
The figure below relates the distribution module with the overall workflow.

The distribution mapper takes the (dictionary) action space as an input and links the action spaces with the respective probability distributions specified in the config. Action logits are learned on top of the representation produced by the perception module where each probability distribution specifies its expected logits shape.
Where to Go Next¶
For further details please see the reference documentation.
Processing raw observations with the Maze Perception Module.
Customizing models with Hydra.
Working with Template Models¶
The Maze template model composer allows us to compose policy and value networks directly from an environment’s observation and action space specification according to a selected model template and a corresponding model config. The central part of a template model composer is the Model Builder holding an Inference Block template (architecture template), which is then instantiated according to the config.
Next, we will introduce the general working principles. However, you can of course directly jump to the examples below to see how to build a feed forward as well as a recurrent policy network using the ConcatModelBuilder or check out how to work with simple single observation and action environments.
List of Features¶
A template model supports the following features:
Works with dictionary observation spaces.
Maps individual observations to modalities via the Observation Modality Mapping.
Allows to individually assign Perception Blocks to modalities via the Modality Config.
Allows to pick architecture templates defining the underlying modal structure via Maze Model Builders.
Cooperates with the Distributions Module supporting customization of action and value outputs.
Allows to individually specify shared embedding keys for actor critic models; this enables shared embeddings between actor and critic.

Note
Maze so far does not support “end-to-end” default behaviour but instead provides config templates, which can be adopted to the respective needs. We opted for this route as complete defaults might lead to unintended and non-transparent results.
Model Builders (Architecture Templates)¶
This section lists and describes the available Model Builder architectures templates. Before we describe the builder instances in more detail we provide some information on the available block types:
Fixed: these blocks are fixed and are applied by the model builder per default.
Preset: these blocks are predefined for the respective model builder. They are basically place holders for which you can specify the perception blocks they should hold.
Custom: these blocks are introduced by the user for processing the respective observation modalities (types) such as features or images.
ConcatModelBuilder (Reference Documentation)

Model builder details:
Processes the individual observations with modality blocks (custom).
Joins the respective modality hidden representations via a concatenation block (fixed).
The resulting representation is then further processed by the hidden and recurrence block (preset).
Example 1: Feed Forward Models¶
In this example we utilize the ConcatModelBuilder to compose a feed forward actor-critic model processing two observations for predicting two actions and one critic (value) output.
Observation Space:
observation_inventory : a 16-dimensional feature vector
observation_screen : a 64 x 64 RGB image
Action Space:
action_move : a categorical action with four options deciding to move [UP, DOWN, LEFT, RIGHT]
action_use : a 16-dimensional multi-binary action deciding which item to use from inventory
The model config is defined as:
# @package model
_target_: maze.perception.models.template_model_composer.TemplateModelComposer
# specify distribution mapping
# (here we use a default distribution mapping)
distribution_mapper_config: []
# specifies the architecture of default models
model_builder:
_target_: maze.perception.builders.ConcatModelBuilder
# Specify up to which keys the embedding should be shared between actor and critic
shared_embedding_keys: ~
# specifies the modality type of each observation
observation_modality_mapping:
observation_inventory: feature
observation_screen: image
# specifies with which block to process a modality
modality_config:
# modality processing
feature:
block_type: maze.perception.blocks.DenseBlock
block_params:
hidden_units: [32, 32]
non_lin: torch.nn.ReLU
image:
block_type: maze.perception.blocks.VGGConvolutionDenseBlock
block_params:
hidden_channels: [8, 16, 32]
hidden_units: [32]
non_lin: torch.nn.ReLU
# preserved keys for the model builder
hidden:
block_type: maze.perception.blocks.DenseBlock
block_params:
hidden_units: [128]
non_lin: torch.nn.ReLU
recurrence: {}
# select policy type
policy:
_target_: maze.perception.models.policies.ProbabilisticPolicyComposer
# select critic type
critic:
_target_: maze.perception.models.critics.StateCriticComposer
Details:
Models are composed by the Maze TemplateModelComposer.
No specific action space and probability distribution overrides are specified.
The model is based on the ConcatModelBuilder architecture template.
No shared embedding is used.
Observation observation_inventory is mapped to the user specified custom modality feature.
Observation observation_screen is mapped to the user specified custom modality image.
Modality Config:
Modalities of type feature are processed with a DenseBlock.
Modalities of type image are processed with a VGGConvolutionDenseBlock.
The concatenated joint spaces is processed with another DenseBlock.
No recurrence is employed.
The resulting inference graphs for an actor-critic model are shown below:


Example 2: Recurrent Models¶
In this example we utilize the ConcatModelBuilder to compose a recurrent actor-critic model for the the previous example.
# @package model
_target_: maze.perception.models.default_model_composer.DefaultModelComposer
# specify distribution mapping
# (here we use a default distribution mapping)
distribution_mapper_config: []
# specifies the architecture of default models
model_builder:
_target_: maze.perception.builders.ConcatModelBuilder
# Specify up to which keys the embedding should be shared between actor and critic
shared_embedding_keys: ~
# specifies the modality type of each observation
observation_modality_mapping:
observation_inventory: feature
observation_screen: image
# specifies with which block to process a modality
modality_config:
# modality processing
feature:
block_type: maze.perception.blocks.DenseBlock
block_params:
hidden_units: [32, 32]
non_lin: torch.nn.ReLU
image:
block_type: maze.perception.blocks.VGGConvolutionDenseBlock
block_params:
hidden_channels: [8, 16, 32]
hidden_units: [32]
non_lin: torch.nn.ReLU
# preserved keys for the model builder
hidden:
block_type: maze.perception.blocks.DenseBlock
block_params:
hidden_units: [128]
non_lin: torch.nn.ReLU
recurrence:
block_type: maze.perception.blocks.LSTMLastStepBlock
block_params:
hidden_size: 32
num_layers: 1
bidirectional: False
non_lin: torch.nn.SELU
# select policy type
policy:
_target_: maze.perception.models.policies.ProbabilisticPolicyComposer
# select critic type
critic:
_target_: maze.perception.models.critics.StateCriticComposer
Details:
The main part of the model is identical to the example above.
However, the example adds an additional recurrent block (LSTMLastStepBlock) considering not only the present but also the k previous time steps for its action and value predictions.
The resulting inference graphs for a recurrent actor-critic model are shown below:


Example 3: Single Observation and Action Models¶
Even though designed for more complex models which process multiple observations and prediction multiple actions at the same time you can of course also compose models for simpler use cases.
In this example we utilize the ConcatModelBuilder to compose an actor-critic model for OpenAI Gym’s CartPole Env. CartPole has an observation space with dimensionality four and a discrete action spaces with two options.
The model config is defined as:
# @package model
_target_: maze.perception.models.template_model_composer.TemplateModelComposer
# specify distribution mapping
# (here we use a default distribution mapping)
distribution_mapper_config: []
# specifies the architecture of default models
model_builder:
_target_: maze.perception.builders.ConcatModelBuilder
# Specify up to which keys the embedding should be shared between actor and critic
shared_embedding_keys: ~
# specifies the modality type of each observation
observation_modality_mapping:
observation: feature
# specifies with which block to process a modality
modality_config:
# modality processing
feature:
block_type: maze.perception.blocks.DenseBlock
block_params:
hidden_units: [32, 32]
non_lin: torch.nn.ReLU
# preserved keys for the model builder
hidden: {}
recurrence: {}
# select policy type
policy:
_target_: maze.perception.models.policies.ProbabilisticPolicyComposer
# select critic type
critic:
_target_: maze.perception.models.critics.StateCriticComposer
The resulting inference graphs for an actor-critic model are shown below:


Details:
When there is only one observation, as for the present example, concatenation acts simply as an identity mapping of the previous output tensor (in this case observation_DenseBlock).
Where to Go Next¶
You can read up on our general introduction to the Perception Module.
Here we explain how to define and work with custom models in case the template models are not sufficient.
Working with Custom Models¶
The Maze custom model composer enables us to explicitly specify application specific models directly in Python. Models can be either written with Maze perception blocks or with plain PyTorch as long as they inherit from Pytorch’s nn.Model.
As such models can be easily created, and even existing models from previous work or well known papers can be easily reused with minor adjustments. However, we recommend to create models using the predefined perception blocks in order to speed up writing as well as to take full advantage of features such as shape inference and graphical rendering of the models.
On this page we cover the features and general working principles. Afterwards we demonstrate the custom model composer with three examples:
A simple feed forward model for cartpole.
A more complex recurrent network example.
The cartpole example, this time using plain PyTorch (that is, no Maze-Perception Blocks).
List of Features¶
The custom model composer supports the following features:
Specify complex models directly in Python.
Supports shape inference and shape checks for a given observation space when relying on Maze perception blocks.
Reuse existing PyTorch nn.Models with minor modifications.
Stores a graphical rendering of the networks if the
InferenceBlock
is utilized.Custom weight initialization and action head biasing.
Custom shared embedding between actor and critic.

The Custom Models Signature¶
The constraints we impose on any model used in conjunction with the custom model composer are threefold: fist, the network class has to adhere to PyTorch’s nn.Model and implement the forward method. Second, a custom network class requires specified constructor arguments depending on the type of network (policy, state critic, …). And lastly the model has to return a dictionary when calling the forward method.
Policy Networks require the constructor arguments obs_shapes and action_logits_shapes. When models are built in the the custom model composer these two arguments are passed to the model constructor in addition to any other arbitrary arguments specified. obs_shapes is a dictionary, mapping observation names to their corresponding shapes. Similarly action_logits_shapes is a dictionary that maps action names to their corresponding action logits shapes. Both, observation and action logits shapes are automatically inferred in the model composer.
implement nn.Model
constructor arguments: obs_shapes and action_logits_shapes
return type of forward method: Here the forward method has to return a dict, where the keys correspond to the actions of the environment.
State Critic Networks require only the constructor argument obs_shapes.
implement nn.Model
constructor arguments: obs_shapes
return type of forward method: The critic networks also have to return a dict, where the key is ‘value’.
Example 1: Simple Networks with Perception Blocks¶
Even though designed for more complex models that process multiple observations and predict multiple actions at the same time you can also compose models for simpler use cases, of course.
In this example we utilize the custom model composer in combination with the perception blocks to compose an actor-critic model for OpenAI Gym’s CartPole using a single dense block in each network. CartPole has an observation space with dimensionality four and a discrete action space with two options.
The policy model can then be defined as:
"""Shows how to use the custom model composer to build a custom policy network."""
from collections import OrderedDict
from typing import Dict, Union, Sequence, List
import numpy as np
import torch
import torch.nn as nn
from maze.perception.blocks.feed_forward.dense import DenseBlock
from maze.perception.blocks.inference import InferenceBlock
from maze.perception.blocks.output.linear import LinearOutputBlock
from maze.perception.weight_init import make_module_init_normc
class CustomCartpolePolicyNet(nn.Module):
"""Simple feed forward policy network.
:param obs_shapes: The shapes of all observations as a dict.
:param action_logits_shapes: The shapes of all actions as a dict structure.
:param non_lin: The nonlinear activation to be used.
:param hidden_units: A list of units per hidden layer.
"""
def __init__(self, obs_shapes: Dict[str, Sequence[int]], action_logits_shapes: Dict[str, Sequence[int]],
non_lin: Union[str, type(nn.Module)], hidden_units: List[int]):
super().__init__()
# Maze relies on dictionaries to represent the inference graph
self.perception_dict = OrderedDict()
# build latent embedding block
self.perception_dict['latent'] = DenseBlock(
in_keys='observation', out_keys='latent', in_shapes=obs_shapes['observation'],
hidden_units=hidden_units,non_lin=non_lin)
# build action head
self.perception_dict['action'] = LinearOutputBlock(
in_keys='latent', out_keys='action', in_shapes=self.perception_dict['latent'].out_shapes(),
output_units=int(np.prod(action_logits_shapes["action"])))
# build inference block
self.perception_net = InferenceBlock(
in_keys='observation', out_keys='action', in_shapes=obs_shapes['observation'],
perception_blocks=self.perception_dict)
# apply weight init
self.perception_net.apply(make_module_init_normc(1.0))
self.perception_dict['action'].apply(make_module_init_normc(0.01))
def forward(self, in_tensor_dict: Dict[str, torch.Tensor]) -> Dict[str, torch.Tensor]:
"""Compute forward pass through the network.
:param in_tensor_dict: Input tensor dict.
:return: The computed output of the network.
"""
return self.perception_net(in_tensor_dict)
And the critic model as:
"""Shows how to use the custom model composer to build a custom value network."""
from collections import OrderedDict
from typing import Dict, Union, Sequence, List
import torch
import torch.nn as nn
from maze.perception.blocks.feed_forward.dense import DenseBlock
from maze.perception.blocks.inference import InferenceBlock
from maze.perception.blocks.output.linear import LinearOutputBlock
from maze.perception.weight_init import make_module_init_normc
class CustomCartpoleCriticNet(nn.Module):
"""Simple feed forward critic network.
:param obs_shapes: The shapes of all observations as a dict.
:param non_lin: The nonlinear activation to be used.
:param hidden_units: A list of units per hidden layer.
"""
def __init__(self, obs_shapes: Dict[str, Sequence[int]], non_lin: Union[str, type(nn.Module)],
hidden_units: List[int]):
super().__init__()
# Maze relies on dictionaries to represent the inference graph
self.perception_dict = OrderedDict()
# build latent embedding block
self.perception_dict['latent'] = DenseBlock(
in_keys='observation', out_keys='latent', in_shapes=obs_shapes['observation'], hidden_units=hidden_units,
non_lin=non_lin)
# build action head
self.perception_dict['value'] = LinearOutputBlock(
in_keys='latent', out_keys='value', in_shapes=self.perception_dict['latent'].out_shapes(), output_units=1)
# build inference block
self.perception_net = InferenceBlock(
in_keys='observation', out_keys='value', in_shapes=obs_shapes['observation'],
perception_blocks=self.perception_dict)
# apply weight init
self.perception_net.apply(make_module_init_normc(1.0))
self.perception_dict['value'].apply(make_module_init_normc(0.01))
def forward(self, in_tensor_dict: Dict[str, torch.Tensor]) -> Dict[str, torch.Tensor]:
"""Compute forward pass through the network.
:param in_tensor_dict: Input tensor dict.
:return: The computed output of the network.
"""
return self.perception_net(in_tensor_dict)
An example config for the model composer could then look like this:
# @package model
# specify the custom model composer by reference
_target_: maze.perception.models.custom_model_composer.CustomModelComposer
# Specify distribution mapping
# (here we use a default distribution mapping)
distribution_mapper_config: []
policy:
# first specify the policy type
_target_: maze.perception.models.policies.ProbabilisticPolicyComposer
# specify the policy network(s) we would like to use, by reference
networks:
- _target_: docs.source.policy_and_value_networks.code_snippets.custom_cartpole_policy_net.CustomCartpolePolicyNet
# specify the parameters of our model
non_lin: torch.nn.ReLU
hidden_units: [16, 32]
substeps_with_separate_agent_nets: []
critic:
# first specify the critic type (here a state value critic)
_target_: maze.perception.models.critics.StateCriticComposer
# specify the critic network(s) we would like to use, by reference
networks:
- _target_: docs.source.policy_and_value_networks.code_snippets.custom_cartpole_critic_net.CustomCartpoleCriticNet
# specify the parameters of our model
non_lin: torch.nn.ReLU
hidden_units: [16, 32]
Details:
Models are composed by the CustomModelComposer.
No specific action space and probability distribution overrides are specified.
It specifies a probabilistic policy, the policy network to use and its constructor arguments.
It specifies a state critic, the value network to use and its constructor arguments.
Given this config, the resulting inference graphs are shown below:


Example 2: Complex Networks with Perception Blocks¶
Now we consider the more complex example already used in the template model composer.
The observation space is defined as:
observation_screen : a 64 x 64 RGB image
observation_inventory : a 16-dimensional feature vector
The action space is defined as:
action_move : a categorical action with four options deciding to move [UP, DOWN, LEFT, RIGHT]
action_use : a 16-dimensional multi-binary action deciding which item to use from inventory
Since we will build a policy and state critic network, where both networks should have the same low level network structure we can create a common base or latent space network:
"""Shows how to use the custom model composer to build a complex custom embedding networks."""
from collections import OrderedDict
from typing import Dict, Union, Sequence, List
import torch.nn as nn
from maze.perception.blocks.feed_forward.dense import DenseBlock
from maze.perception.blocks.general.concat import ConcatenationBlock
from maze.perception.blocks.joint_blocks.lstm_last_step import LSTMLastStepBlock
from maze.perception.blocks.joint_blocks.vgg_conv_dense import VGGConvolutionDenseBlock
class CustomComplexLatentNet:
"""Simple feed forward policy network.
:param obs_shapes: The shapes of all observations as a dict.
:param non_lin: The nonlinear activation to be used.
:param hidden_units: A list of units per hidden layer.
"""
def __init__(self, obs_shapes: Dict[str, Sequence[int]],
non_lin: Union[str, type(nn.Module)], hidden_units: List[int]):
self.obs_shapes = obs_shapes
# Maze relies on dictionaries to represent the inference graph
self.perception_dict = OrderedDict()
# build latent feature embedding block
self.perception_dict['latent_inventory'] = DenseBlock(
in_keys='observation_inventory', out_keys='latent_inventory', in_shapes=obs_shapes['observation_inventory'],
hidden_units=[128], non_lin=non_lin)
# build latent pixel embedding block
self.perception_dict['latent_screen'] = VGGConvolutionDenseBlock(
in_keys='observation_screen', out_keys='latent_screen', in_shapes=obs_shapes['observation_screen'],
non_lin=non_lin, hidden_channels=[8, 16, 32], hidden_units=[32])
# Concatenate latent features
self.perception_dict['latent_concat'] = ConcatenationBlock(
in_keys=['latent_inventory', 'latent_screen'], out_keys='latent_concat',
in_shapes=self.perception_dict['latent_inventory'].out_shapes() +
self.perception_dict['latent_screen'].out_shapes(), concat_dim=-1)
# Add latent dense block
self.perception_dict['latent_dense'] = DenseBlock(
in_keys='latent_concat', out_keys='latent_dense', hidden_units=hidden_units, non_lin=non_lin,
in_shapes=self.perception_dict['latent_concat'].out_shapes()
)
# Add recurrent block
self.perception_dict['latent'] = LSTMLastStepBlock(
in_keys='latent_dense', out_keys='latent', in_shapes=self.perception_dict['latent_dense'].out_shapes(),
hidden_size=32, num_layers=1, bidirectional=False, non_lin=non_lin
)
Given this base class we can now create the policy network:
"""Shows how to use the custom model composer to build a complex custom policy networks."""
from typing import Dict, Union, Sequence, List
import numpy as np
import torch
import torch.nn as nn
from docs.source.policy_and_value_networks.code_snippets.custom_complex_latent_net import \
CustomComplexLatentNet
from maze.perception.blocks.inference import InferenceBlock
from maze.perception.blocks.output.linear import LinearOutputBlock
from maze.perception.weight_init import make_module_init_normc
class CustomComplexPolicyNet(nn.Module, CustomComplexLatentNet):
"""Simple feed forward policy network.
:param obs_shapes: The shapes of all observations as a dict.
:param action_logits_shapes: The shapes of all actions as a dict structure.
:param non_lin: The nonlinear activation to be used.
:param hidden_units: A list of units per hidden layer.
"""
def __init__(self, obs_shapes: Dict[str, Sequence[int]], action_logits_shapes: Dict[str, Sequence[int]],
non_lin: Union[str, type(nn.Module)], hidden_units: List[int]):
nn.Module.__init__(self)
CustomComplexLatentNet.__init__(self, obs_shapes, non_lin, hidden_units)
# build action heads
for action_key, action_shape in action_logits_shapes.items():
self.perception_dict[action_key] = LinearOutputBlock(
in_keys='latent', out_keys=action_key, in_shapes=self.perception_dict['latent'].out_shapes(),
output_units=int(np.prod(action_shape)))
# build inference block
in_keys = list(self.obs_shapes.keys())
self.perception_net = InferenceBlock(
in_keys=in_keys, out_keys=list(action_logits_shapes.keys()), perception_blocks=self.perception_dict,
in_shapes=[self.obs_shapes[key] for key in in_keys])
# apply weight init
self.perception_net.apply(make_module_init_normc(1.0))
for action_key in action_logits_shapes.keys():
self.perception_dict[action_key].apply(make_module_init_normc(0.01))
def forward(self, in_tensor_dict: Dict[str, torch.Tensor]) -> Dict[str, torch.Tensor]:
"""Compute forward pass through the network.
:param in_tensor_dict: Input tensor dict.
:return: The computed output of the network.
"""
return self.perception_net(in_tensor_dict)
… and the critic network:
"""Shows how to use the custom model composer to build a complex custom value networks."""
from typing import Dict, Union, Sequence, List
import torch
import torch.nn as nn
from docs.source.policy_and_value_networks.code_snippets.custom_complex_latent_net import \
CustomComplexLatentNet
from maze.perception.blocks.inference import InferenceBlock
from maze.perception.blocks.output.linear import LinearOutputBlock
from maze.perception.weight_init import make_module_init_normc
class CustomComplexCriticNet(nn.Module, CustomComplexLatentNet):
"""Simple feed forward policy network.
:param obs_shapes: The shapes of all observations as a dict.
:param non_lin: The nonlinear activation to be used.
:param hidden_units: A list of units per hidden layer.
"""
def __init__(self, obs_shapes: Dict[str, Sequence[int]],
non_lin: Union[str, type(nn.Module)], hidden_units: List[int]):
nn.Module.__init__(self)
CustomComplexLatentNet.__init__(self, obs_shapes, non_lin, hidden_units)
# build action heads
self.perception_dict['value'] = LinearOutputBlock(
in_keys='latent', out_keys='value', in_shapes=self.perception_dict['latent'].out_shapes(),
output_units=1)
# build inference block
in_keys = list(self.obs_shapes.keys())
self.perception_net = InferenceBlock(
in_keys=in_keys, out_keys='value', in_shapes=[self.obs_shapes[key] for key in in_keys],
perception_blocks=self.perception_dict)
# apply weight init
self.perception_net.apply(make_module_init_normc(1.0))
self.perception_dict['value'].apply(make_module_init_normc(0.01))
def forward(self, in_tensor_dict: Dict[str, torch.Tensor]) -> Dict[str, torch.Tensor]:
"""Compute forward pass through the network.
:param in_tensor_dict: Input tensor dict.
:return: The computed output of the network.
"""
return self.perception_net(in_tensor_dict)
An example config for the model composer could then look like this:
# @package model
# specify the custom model composer by reference
_target_: maze.perception.models.custom_model_composer.CustomModelComposer
# Specify distribution mapping
# (here we use a default distribution mapping)
distribution_mapper_config: []
policy:
_target_: maze.perception.models.policies.ProbabilisticPolicyComposer
networks:
# specify the policy network we would like to use, by reference
- _target_: docs.source.policy_and_value_networks.code_snippets.custom_complex_policy_net.CustomComplexPolicyNet
# specify the parameters of our model
non_lin: torch.nn.ReLU
hidden_units: [128]
substeps_with_separate_agent_nets: []
critic:
# first specify the critic type (single step in this example)
_target_: maze.perception.models.critics.StateCriticComposer
networks:
# specify the critic we would like to use, by reference
- _target_: docs.source.policy_and_value_networks.code_snippets.custom_complex_critic_net.CustomComplexCriticNet
# specify the parameters of our model
non_lin: torch.nn.ReLU
hidden_units: [128]
The resulting inference graphs for a recurrent actor-critic model are shown below. Note that the models are identical except for the output layers due to the shared base model.


Example 3: Custom Networks with (plain PyTorch) Python¶
Here, we take a look at how to create a custom model with plain PyTorch. As already mentioned, we still have to specify the constructor arguments obs_shapes and action_logits_shapes but not necessarily need to use them.
Important: Your models have to use dictionaries with torch.Tensors as values for both inputs and outputs.
For Gym’s CartPole the policy model could be defined like this:
"""Shows how to create a custom cartpole model using no maze perception components."""
from typing import Dict, Sequence
import torch
import torch.nn as nn
import torch.nn.functional as F
class CustomPlainCartpolePolicyNet(nn.Module):
"""Simple feed forward policy network.
:param obs_shapes: The shapes of all observations as a dict.
:param action_logits_shapes: The shapes of all actions as a dict structure.
:param hidden_layer_0: The number of units in layer 0.
:param hidden_layer_1: The number of units in layer 1.
:param use_bias: Specify whether to use a bias in the linear layers.
"""
def __init__(self, obs_shapes: Dict[str, Sequence[int]], action_logits_shapes: Dict[str, Sequence[int]],
hidden_layer_0: int, hidden_layer_1: int, use_bias: bool):
nn.Module.__init__(self)
self.observation_name = list(obs_shapes.keys())[0]
self.action_name = list(action_logits_shapes.keys())[0]
self.l0 = nn.Linear(4, hidden_layer_0, bias=use_bias)
self.l1 = nn.Linear(hidden_layer_0, hidden_layer_1, bias=use_bias)
self.l2 = nn.Linear(hidden_layer_1, 2, bias=use_bias)
def reset_parameters(self) -> None:
"""Reset the parameters of the Model"""
self.l0.reset_parameters()
self.l1.reset_parameters()
self.l1.reset_parameters()
def forward(self, in_tensor_dict: Dict[str, torch.Tensor]) -> Dict[str, torch.Tensor]:
"""Compute forward pass through the network.
:param in_tensor_dict: Input tensor dict.
:return: The computed output of the network.
"""
# Retrieve the observation tensor from the input dict
xx_tensor = in_tensor_dict[self.observation_name]
# Compute the forward pass thorough the network
xx_tensor = F.relu(self.l0(xx_tensor))
xx_tensor = F.relu(self.l1(xx_tensor))
xx_tensor = self.l2(xx_tensor)
# Create the output dictionary with the computed model output
out = dict({self.action_name: xx_tensor})
return out
And the critic model as:
"""Shows how to create a custom cartpole model using no maze perception components."""
from typing import Dict, Sequence
import torch
import torch.nn as nn
import torch.nn.functional as F
class CustomPlainCartpoleCriticNet(nn.Module):
"""Simple feed forward critic network.
:param obs_shapes: The shapes of all observations as a dict.
:param hidden_layer_0: The number of units in layer 0.
:param hidden_layer_1: The number of units in layer 1.
:param use_bias: Specify whether to use a bias in the linear layers.
"""
def __init__(self, obs_shapes: Dict[str, Sequence[int]],
hidden_layer_0: int, hidden_layer_1: int, use_bias: bool):
nn.Module.__init__(self)
self.observation_name = list(obs_shapes.keys())[0]
self.l0 = nn.Linear(4, hidden_layer_0, bias=use_bias)
self.l1 = nn.Linear(hidden_layer_0, hidden_layer_1, bias=use_bias)
self.l2 = nn.Linear(hidden_layer_1, 1, bias=use_bias)
def reset_parameters(self) -> None:
"""Reset the parameters of the Model"""
self.l0.reset_parameters()
self.l1.reset_parameters()
self.l1.reset_parameters()
def forward(self, in_tensor_dict: Dict[str, torch.Tensor]) -> Dict[str, torch.Tensor]:
"""Compute forward pass through the network.
:param in_tensor_dict: Input tensor dict.
:return: The computed output of the network.
"""
# Retrieve the observation tensor from the input dict
xx_tensor = in_tensor_dict[self.observation_name]
# Compute the forward pass thorough the network
xx_tensor = F.relu(self.l0(xx_tensor))
xx_tensor = F.relu(self.l1(xx_tensor))
xx_tensor = self.l2(xx_tensor)
# Create the output dictionary with the computed model output
out = dict({'value': xx_tensor})
return out
An example config for the model composer could then look like this:
# @package model
# specify the custom model composer by reference
_target_: maze.perception.models.custom_model_composer.CustomModelComposer
# Specify distribution mapping
# (here we use a default distribution mapping)
distribution_mapper_config: []
policy:
# first specify the policy type
_target_: maze.perception.models.policies.ProbabilisticPolicyComposer
# specify the policy network(s) we would like to use, by reference
networks:
- _target_: docs.source.policy_and_value_networks.code_snippets.custom_plain_cartpole_policy_net.CustomPlainCartpolePolicyNet
# specify the parameters of our model
hidden_layer_0: 16
hidden_layer_1: 32
use_bias: True
substeps_with_separate_agent_nets: []
critic:
# first specify the critic type (here a state value critic)
_target_: maze.perception.models.critics.StateCriticComposer
# specify the critic network(s) we would like to use, by reference
networks:
- _target_: docs.source.policy_and_value_networks.code_snippets.custom_plain_cartpole_critic_net.CustomPlainCartpoleCriticNet
# specify the parameters of our model
hidden_layer_0: 16
hidden_layer_1: 32
use_bias: True
Note
Since we do not use the inference block in this example, no visual representation of the model can be rendered.
Where to Go Next¶
You can read up on our general introduction to the Perception Module.
As an alternative to custom models you can also use the template model builder.
Learn how to link policy network outputs with probability distributions for action sampling.
Maze Trainers¶
Trainers are the central components of the Maze framework when it comes to optimizing policies using different RL algorithms. To be more specific, Trainers and TrainingRunners are responsible for the following tasks:
manage the model types (actor networks, state-critics, state-action-critic, …),
manage agent environment interaction and trajectory data generation,
compute the loss (specific to the algorithm used),
update the weights in order to decrease the loss and increase the performance,
collect and log training statistics,
manage model checkpoints and the training process (e.g., early stopping).
The figure below provides an overview of the currently supported Trainers.

This page gives a general (high-level) overview of the Trainers and corresponding algorithms supported by the Maze framework. For more details especially on the implementation please refer to the API documentation on Trainers. For more details on the training workflow and how to start trainings using the Hydra config system please refer to the training section.
Overview
Supported Spaces¶
If not stated otherwise, Maze Trainers support dictionary spaces for both observations and actions.
If the environment you are working with does not yet interact via dictionary spaces
simply wrap it with the built-in
DictActionWrapper
for actions
and
DictObservationWrapper
for observations.
In case of standard Gym environments just use the GymMazeEnv
.
Advantage Actor-Critic (A2C)¶
A2C is a synchronous version of the originally proposed Asynchronous Advantage Actor-Critic (A3C). As a policy gradient method it maintains a probabilistic policy, computing action selection probabilities, as well as a critic, predicting the state value function. By setting the number of rollout steps as well as the number of parallel environments one can control the batch size used for updating the policy and value function in each iteration.
Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., & Kavukcuoglu, K. (2016). Asynchronous methods for deep reinforcement learning. In International conference on machine learning (pp. 1928-1937).
Example
$ maze-run -cn conf_train env.name=CartPole-v0 algorithm=a2c model=vector_obs critic=template_state
rc = RunContext(
algorithm="a2c",
overrides={"env.name": "CartPole-v0"},
model="vector_obs",
critic="template_state"
)
rc.train()
Algorithm Parameters | A2CAlgorithmConfig
Default parameters (maze/conf/algorithm/a2c.yaml)
# @package algorithm
# number of epochs to train
n_epochs: 0
# number of updates per epoch
epoch_length: 25
# number of steps used for early stopping
patience: 15
# number of critic (value function) burn in epochs
critic_burn_in_epochs: 0
# Number of steps taken for each rollout
n_rollout_steps: 100
# learning rate
lr: 0.0005
# discounting factor
gamma: 0.98
# weight of policy loss
policy_loss_coef: 1.0
# weight of value loss
value_loss_coef: 0.5
# weight of entropy loss
entropy_coef: 0.00025
# The maximum allowed gradient norm during training
max_grad_norm: 0.0
# Either "cpu" or "cuda"
device: cpu
# bias vs variance trade of factor for Generalized Advantage Estimator (GAE)
gae_lambda: 1.0
# Rollout evaluator (used for best model selection)
rollout_evaluator:
_target_: maze.train.trainers.common.evaluators.rollout_evaluator.RolloutEvaluator
# Run evaluation in deterministic mode (argmax-policy)
deterministic: true
# Number of evaluation trials
n_episodes: 8
Runner Parameters | ACRunner
Default parameters (maze/conf/algorithm_runner/a2c-dev.yaml)
# @package runner
_target_: "maze.train.trainers.common.actor_critic.actor_critic_runners.ACDevRunner"
# model class used for policy and critic updates
trainer_class: maze.train.trainers.a2c.a2c_trainer.A2C
# Number of concurrently executed environments
concurrency: 0
# Number of concurrent evaluation envs
eval_concurrency: 0
Default parameters (maze/conf/algorithm_runner/a2c-local.yaml)
# @package runner
_target_: "maze.train.trainers.common.actor_critic.actor_critic_runners.ACLocalRunner"
# model class used for policy and critic updates
trainer_class: maze.train.trainers.a2c.a2c_trainer.A2C
# Number of concurrently executed environments
concurrency: 0
# Number of concurrent evaluation envs
eval_concurrency: 0
Proximal Policy Optimization (PPO)¶
The PPO algorithm belongs to the class of actor-critic style policy gradient methods. It optimizes a “surrogate” objective function adopted from trust region methods. As such, it alternates between generating trajectory data via agent rollouts from the environment and optimizing the objective function by means of a stochastic mini-batch gradient ascent.
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
Example
$ maze-run -cn conf_train env.name=CartPole-v0 algorithm=ppo model=vector_obs critic=template_state
rc = RunContext(
algorithm="ppo",
overrides={"env.name": "CartPole-v0"},
model="vector_obs",
critic="template_state"
)
rc.train()
Algorithm Parameters | PPOAlgorithmConfig
Default parameters (maze/conf/algorithm/ppo.yaml)
# @package algorithm
# number of epochs to train
n_epochs: 0
# number of updates per epoch
epoch_length: 25
# number of steps used for early stopping
patience: 15
# number of critic (value function) burn in epochs
critic_burn_in_epochs: 0
# Number of steps taken for each rollout
n_rollout_steps: 100
# learning rate
lr: 0.00025
# discounting factor
gamma: 0.98
# bias vs variance trade of factor for Generalized Advantage Estimator (GAE)
gae_lambda: 1.0
# weight of policy loss
policy_loss_coef: 1.0
# weight of value loss
value_loss_coef: 0.5
# weight of entropy loss
entropy_coef: 0.00025
# The maximum allowed gradient norm during training
max_grad_norm: 0.0
# Either "cpu" or "cuda"
device: cpu
# The batch size used for policy and value updates
batch_size: 100
# Number of epochs for policy and value optimization
n_optimization_epochs: 4
# Clipping parameter of surrogate loss
clip_range: 0.2
# Rollout evaluator (used for best model selection)
rollout_evaluator:
_target_: maze.train.trainers.common.evaluators.rollout_evaluator.RolloutEvaluator
# Run evaluation in deterministic mode (argmax-policy)
deterministic: true
# Number of evaluation trials
n_episodes: 8
Runner Parameters | ACRunner
Default parameters (maze/conf/algorithm_runner/ppo-dev.yaml)
# @package runner
_target_: "maze.train.trainers.common.actor_critic.actor_critic_runners.ACDevRunner"
# model class used for policy and critic updates
trainer_class: maze.train.trainers.ppo.ppo_trainer.PPO
# Number of concurrently executed environments
concurrency: 0
# Number of concurrent evaluation envs
eval_concurrency: 0
Default parameters (maze/conf/algorithm_runner/ppo-local.yaml)
# @package runner
_target_: "maze.train.trainers.common.actor_critic.actor_critic_runners.ACLocalRunner"
# model class used for policy and critic updates
trainer_class: maze.train.trainers.ppo.ppo_trainer.PPO
# Number of concurrently executed environments
concurrency: 0
# Number of concurrent evaluation envs
eval_concurrency: 0
Importance Weighted Actor-Learner Architecture (IMPALA)¶
IMPALA is a RL algorithm able to scale to a very large number of machines. Multiple workers collect trajectories (sequences of states, actions and rewards), which are communicated to a learner responsible for updating the policy by utilizing stochastic mini-batch gradient decent and the proposed V-trace correction algorithm. By decoupling rollouts (interactions with the environment) and policy updates the algorithm is considered off-policy and asynchronous, making it very suitable for compute-intense environments.
Espeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih, V., Ward, T., Doron, Y., Firoiu, V., Harley, T., Dunning, I., Legg, S., & Kavukcuoglu, K. (2018). Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. arXiv preprint arXiv:1802.01561.
Example
$ maze-run -cn conf_train env.name=CartPole-v0 algorithm=impala model=vector_obs critic=template_state
rc = RunContext(
algorithm="impala",
overrides={"env.name": "CartPole-v0"},
model="vector_obs",
critic="template_state"
)
rc.train()
Algorithm Parameters | ImpalaAlgorithmConfig
Default parameters (maze/conf/algorithm/impala.yaml)
# @package algorithm
# Common Actor critic parameters
# number of epochs to train
n_epochs: 0
# number of updates per epoch
epoch_length: 25
# number of steps used for early stopping
patience: 15
# number of critic (value function) burn in epochs
critic_burn_in_epochs: 0
# number of rolloutstep of each epoch substep
n_rollout_steps: 100
# learning rate
lr: 0.0005
# discount factor
gamma: 0.98
# coefficient of the policy used in the loss calculation
policy_loss_coef: 1.0
# coefficient of the value used in the loss calculation
value_loss_coef: 0.5
# coefficient of the entropy used in the loss calculation
entropy_coef: 0.01
# max grad norm for gradient clipping, ignored if value==0
max_grad_norm: 0
# Device of the learner (either cpu or cuda)
# Note that the actors collecting rollouts are always run on CPU.
device: "cpu"
# Impala specific events ----------------------
# this factor multiplied by the actor_batch_size gives the size of the queue for
# the agents output collected by the learner. Therefor if the all rollouts computed can be at most
# (queue_out_of_sync_factor + num_agents/actor_batch_size) out of sync with learner policy
queue_out_of_sync_factor: 1
# number of actors to combine to one batch
actors_batch_size: 8
# number of actors to be run
num_actors: 8
# A scalar float32 tensor with the clipping threshold for importance weights
# (rho) when calculating the baseline targets (vs). rho^bar in the paper. If None, no clipping is applied.
vtrace_clip_rho_threshold: 1.0
# A scalar float32 tensor with the clipping threshold on rho_s in
# \rho_s \delta log \pi(a|x) (r + \gamma v_{s+1} - V(x_sfrom_importance_weights)). If None, no clipping is
# applied.
vtrace_clip_pg_rho_threshold: 1.0
# Rollout evaluator (used for best model selection)
rollout_evaluator:
_target_: maze.train.trainers.common.evaluators.rollout_evaluator.RolloutEvaluator
# Run evaluation in deterministic mode (argmax-policy)
deterministic: true
# Number of evaluation trials
n_episodes: 8
Runner Parameters | ImpalaRunner
Default parameters (maze/conf/algorithm_runner/impala-dev.yaml)
# @package runner
_target_: "maze.train.trainers.impala.impala_runners.ImpalaDevRunner"
# Number of concurrent evaluation envs
eval_concurrency: 0
Default parameters (maze/conf/algorithm_runner/impala-local.yaml)
# @package runner
_target_: "maze.train.trainers.impala.impala_runners.ImpalaLocalRunner"
# type of startmethod used for multiprocessing: 'forkserver', 'spawn', 'fork', 'dummy'
start_method: forkserver
# Number of concurrent evaluation envs
eval_concurrency: 0
Soft Actor-Critic (from Demonstrations) (SAC, SACfD)¶
An off-policy actor-critic algorithm based on the maximum entropy reinforcement learning framework with the goal of maximizing expected reward while at the same time maximizing entropy. SAC exhibits high sample efficiency, is stable across different random seeds, and achieves competitive performance especially for continuous control tasks. In contrast to A2C, PPO and IMPALA it utilizes a stochastic state-action critic.
Additionally, our implementation allows to initialize the replay buffer from existing demonstrations (e.g., rollouts) instead of sampling the initial transitions with the given sampling policy (per default random). This variant is called Soft Actor-Critic from Demonstrations.
Haarnoja, T., Zhou, A., Abbeel, P., Levine, S. (2018). Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. arXiv preprint arXiv:1801.01290., Haarnoja, T., Zhou, A., Hartikainen, K., Tucker, G., Ha, S., Tan, J., … & Levine, S. (2018). Soft Actor-Critic Algorithms and Applications. arXiv preprint arXiv:1812.05905., Christodoulou, P. (2019). Soft actor-critic for discrete action settings arXiv preprint arXiv:1910.07207.
Example SAC
$ maze-run -cn conf_train env.name=Pendulum-v0 algorithm=sac model=vector_obs critic=template_state_action
rc = RunContext(
algorithm="sac",
overrides={"env.name": "Pendulum-v0"},
model="vector_obs",
critic="template_state_action"
)
rc.train()
Example SACfD
$ maze-run env.name=LunarLander-v2 policy=lunar_lander_heuristics runner.n_episodes=1000
$ maze-run -cn conf_train env.name=LunarLander-v2 algorithm=sacfd model=flatten_concat critic=flatten_concat_state_action runner.initial_demonstration_trajectories.input_data=<absolute_experiment_path>/trajectory_data
from maze.core.agent.heuristic_lunar_lander_policy import HeuristicLunarLanderPolicy
from maze.core.rollout.parallel_rollout_runner import ParallelRolloutRunner
from maze.core.wrappers.maze_gym_env_wrapper import GymMazeEnv
from maze.api.run_context import RunContext
# Instantiate an example environment and agent
env = GymMazeEnv("LunarLander-v2")
agent = HeuristicLunarLanderPolicy()
# Run a parallel rollout and collect the trajectories
runner = ParallelRolloutRunner(
n_episodes=1000,
max_episode_steps=0,
n_processes=5,
record_trajectory=True,
record_event_logs=True)
runner.run_with(
env=env,
wrappers={},
agent=agent)
rc = RunContext(
algorithm="sacfd",
overrides={"env.name": "LunarLander-v2",
"runner.initial_demonstration_trajectories.input_data": "<absolute_experiment_path>/trajectory_data"},
model="flatten_concat",
critic="flatten_concat_state_action"
)
rc.train()
Algorithm Parameters | SACAlgorithmConfig
Default parameters (maze/conf/algorithm/sac.yaml)
# @package algorithm
# Number of steps taken for each rollout
n_rollout_steps: 1
# Learning rate
lr: 0.001
# The entropy coefficient to use in the loss computation (called alpha in org paper)
entropy_coef: 0.2
# Discounting factor
gamma: 0.99
# The maximum allowed gradient norm during training
max_grad_norm: 0.0
# Number of actors to combine to one batch
batch_size: 100
# Number of batches to update on in each iteration
num_batches_per_iter: 1
# Number of actors to be run
num_actors: 1
# Parameter weighting the soft update of the target network
tau: 0.005
# Specify in what intervals to update the target networks
target_update_interval: 1
# Either "cpu" or "cuda"
device: cpu
# Specify whether to learn the entropy coefficient or rather use the default one (entropy_coef) [called alpha in paper]
entropy_tuning: true
# Specify an optional multiplier for the target entropy. This value is multiplied with the default target entropy
# computation (this is called alpha tuning in the org paper):
# discrete spaces: target_entropy = target_entropy_multiplier * ( - 0.98 * (-log (1 / |A|))
# continues spaces: target_entropy = target_entropy_multiplier * (- dim(A)) (e.g., -6 for HalfCheetah-v1)
target_entropy_multiplier: 1.0
# Learning rate for entropy tuning
entropy_coef_lr: 0.0007
# The size of the replay buffer
replay_buffer_size: 1000000
# The initial buffer size, where transaction are sampled with the initial sampling policy
initial_buffer_size: 10000
# The policy used to initially fill the replay buffer
initial_sampling_policy:
_target_: maze.core.agent.random_policy.RandomPolicy
# Number of rollouts collected from the actor in each iteration
rollouts_per_iteration: 1
# Specify whether all computed rollouts should be split into transitions before processing them
split_rollouts_into_transitions: true
# Number of epochs to train
n_epochs: 0
# Number of updates per epoch
epoch_length: 100
# Number of steps used for early stopping
patience: 50
# Rollout evaluator (used for best model selection)
rollout_evaluator:
_target_: maze.train.trainers.common.evaluators.rollout_evaluator.RolloutEvaluator
# Run evaluation in deterministic mode (argmax-policy)
deterministic: true
# Number of evaluation trials
n_episodes: 8
Runner Parameters SAC | SACRunner
Default parameters (maze/conf/algorithm_runner/sac-dev.yaml)
# @package runner
_target_: "maze.train.trainers.sac.sac_runners.SACDevRunner"
# Number of concurrent evaluation envs
eval_concurrency: 0
# Specify the Dataset class used to load the trajectory data for training, otherwise the initial replay buffer is
# sampled with the provided initial_sampling_policy
initial_demonstration_trajectories: ~
Default parameters (maze/conf/algorithm_runner/sac-local.yaml)
# @package runner
_target_: "maze.train.trainers.sac.sac_runners.SACDevRunner"
# Number of concurrent evaluation envs
eval_concurrency: 0
# Specify the Dataset class used to load the trajectory data for training, otherwise the initial replay buffer is
# sampled with the provided initial_sampling_policy
initial_demonstration_trajectories: ~
Runner Parameters SACfD | SACRunner
Default parameters (maze/conf/algorithm_runner/sacfd-dev.yaml)
# @package runner
_target_: "maze.train.trainers.sac.sac_runners.SACDevRunner"
# Number of concurrent evaluation envs
eval_concurrency: 0
# Specify the dataset class used to load the trajectory data for training, otherwise the initial replay buffer is
# sampled with the provided initial_sampling_policy
initial_demonstration_trajectories:
_target_: maze.core.trajectory_recording.datasets.in_memory_dataset.InMemoryDataset
input_data: trajectory_data
n_workers: 1
deserialize_in_main_thread: false
trajectory_processor:
_target_: maze.core.trajectory_recording.datasets.trajectory_processor.IdentityWithNextObservationTrajectoryProcessor
Default parameters (maze/conf/algorithm_runner/sacfd-local.yaml)
# @package runner
_target_: "maze.train.trainers.sac.sac_runners.SACDevRunner"
# Number of concurrent evaluation envs
eval_concurrency: 0
# Specify the dataset class used to load the trajectory data for training, otherwise the initial replay buffer is
# sampled with the provided initial_sampling_policy
initial_demonstration_trajectories:
_target_: maze.core.trajectory_recording.datasets.in_memory_dataset.InMemoryDataset
input_data: trajectory_data
n_workers: 5
deserialize_in_main_thread: false
trajectory_processor:
_target_: maze.core.trajectory_recording.datasets.trajectory_processor.IdentityWithNextObservationTrajectoryProcessor
Behavioural Cloning (BC)¶
Behavioural cloning is a simple imitation learning algorithm, that infers the behaviour of a “hidden” policy by imitating the actions produced for a given observation in a supervised learning setting. As such, it requires a set of training (example) trajectories collected prior to training.
Hussein, A., Gaber, M. M., Elyan, E., & Jayne, C. (2017). Imitation learning: A survey of learning methods. ACM Computing Surveys (CSUR), 50(2), 1-35.
Example: Imitation Learning and Fine-Tuning
Algorithm Parameters | BCAlgorithmConfig
Default parameters (maze/conf/algorithm_runner/bc.yaml)
# @package algorithm
# Number of epochs to train for
n_epochs: 1000
# Optimizer used to update the policy
optimizer:
_target_: torch.optim.Adam
lr: 0.001
# Device to train on
device: cuda
# Batch size
batch_size: 100
# Number of iterations after which to run evaluation (in addition to evaluations at the end of
# each epoch, which are run automatically). If set to None, evaluations will run on epoch end only.
eval_every_k_iterations: 500
# Percentage of the data used for validation.
validation_percentage: 20
# Number of episodes to run during each evaluation rollout (set to 0 to evaluate using validation only)
n_eval_episodes: 8
# Entropy coefficient for policy optimization.
entropy_coef: 0.0
Runner Parameters | BCRunner
Default parameters (maze/conf/algorithm_runner/bc-dev.yaml)
# @package runner
_target_: maze.train.trainers.imitation.bc_runners.BCDevRunner
# Number of concurrent evaluation envs
eval_concurrency: 0
# Specify the Dataset class used to load the trajectory data for training
dataset:
_target_: maze.core.trajectory_recording.datasets.in_memory_dataset.InMemoryDataset
input_data: trajectory_data
n_workers: 1
deserialize_in_main_thread: false
trajectory_processor:
_target_: maze.core.trajectory_recording.datasets.trajectory_processor.IdentityTrajectoryProcessor
Default parameters (maze/conf/algorithm_runner/bc-local.yaml)
# @package runner
_target_: maze.train.trainers.imitation.bc_runners.BCLocalRunner
# Number of concurrent evaluation envs
eval_concurrency: 0
# Specify the Dataset class used to load the trajectory data for training
dataset:
_target_: maze.core.trajectory_recording.datasets.in_memory_dataset.InMemoryDataset
input_data: trajectory_data
n_workers: 5
deserialize_in_main_thread: false
trajectory_processor:
_target_: maze.core.trajectory_recording.datasets.trajectory_processor.IdentityTrajectoryProcessor
Evolutionary Strategies (ES)¶
Evolutionary strategies is a black box optimization algorithm that utilizes direct policy search and can be very efficiently parallelized. Advantages of this methods include being invariant to action frequencies as well as delayed rewards. Further, it shows tolerance for extremely long time horizons, since it does need to compute or approximate a temporally discounted value function. However, it is considered less sample efficient then actual RL algorithms.
Salimans, T., Ho, J., Chen, X., Sidor, S., & Sutskever, I. (2017). Evolution strategies as a scalable alternative to reinforcement learning. arXiv preprint arXiv:1703.03864.
Example
$ maze-run -cn conf_train env.name=CartPole-v0 algorithm=es model=vector_obs
rc = RunContext(
algorithm="es",
overrides={"env.name": "CartPole-v0"},
model="vector_obs"
)
rc.train()
Algorithm Parameters | ESAlgorithmConfig
Default parameters (maze/conf/algorithm_runner/es.yaml)
# @package algorithm
# Minimum number of episode rollouts per training iteration (=epoch)
n_rollouts_per_update: 10
# Minimum number of cumulative env steps per training iteration (=epoch).
# The training iteration is only finished, once the given number of episodes
# AND the given number of steps has been reached. One of the two parameters
# can be set to 0.
n_timesteps_per_update: 0
# The number of epochs to train before termination. Pass 0 to train indefinitely
n_epochs: 0
# Limit the episode rollouts to a maximum number of steps. Set to 0 to disable this option.
max_steps: 0
# The optimizer to use to update the policy based on the sampled gradient.
optimizer:
_target_: maze.train.trainers.es.optimizers.adam.Adam
step_size: 0.01
# L2 weight regularization coefficient.
l2_penalty: 0.005
# The scaling factor of the random noise applied during training.
noise_stddev: 0.02
# Support for simulation logic or heuristics on top of a TorchPolicy.
policy_wrapper: ~
Runner Parameters | ESMasterRunner
Default parameters (maze/conf/algorithm_runner/es-dev.yaml)
# @package runner
_target_: "maze.train.trainers.es.ESDevRunner"
# Fixed number of evaluation runs per epoch.
n_eval_rollouts: 10
# Number of float values in the deterministically generated pseudo-random table
shared_noise_table_size: 100000000
Maze RLlib Trainer¶
Finally, the Maze framework also contains an RLlib trainer class. This special class of trainers wraps all necessary and convenient Maze components into RLlib compatible objects such that Ray-RLlib can be reused to train Maze policies and critics. This enables us to train Maze Models with Maze action distributions in Maze environments with almost all RLlib algorithms.
RLlib algorithms are currently not supported by RunContext
.
Example and Details: Maze RLlib Runner
Where to Go Next¶
You can read up on our general introduction to the Maze Training Workflow.
To build and use custom Maze models please refer to Maze Perception Module.
You can also look up the supported Action Spaces and Distributions Module.
Maze RLlib Runner¶
The RLlib Runner allows you to use RLlib Trainers in combination with Maze models and environments. Ray-RLlib is one of the most popular RL frameworks (algorithm collections) within the scientific community but also when it comes to practical relevance. It already comprises an extensive and tuned collection of various different RL training algorithms. To gain access to RLlib’s algorithm collection while still having access to all of practical Maze features we introduce the Maze Rllib Module. It basically wraps Maze models (including our extensive Perception Module), Maze environments (including wrappers) as well as the customizable Maze action distributions. It further allows us to use the Maze hydra cmd-line interfaces together with RLlib while at the same time using the well optimized algorithms from RLlib.
This page gives an overview of the RLlib module and provides examples on how to apply it.

List of Features¶
Use Maze environments, models and action distributes in conjunction with RLlib algorithms.
Make full use of the Maze environment customization utils (wrappers, pre-processing, …).
Use the hydra cmd-line interface to start training runs.
Models trained with the Maze RLlib Runner are fully compatible with the remaining framework (except when using the default RLlib models).
Example 1: Training with Maze-RLlib and Hydra¶
Using RLlib algorithms with Maze and Hydra works analogously to starting training with native Maze Trainers. To train the CartPole environment with RLlib’s PPO, run:
$ maze-run -cn conf_rllib env.name=CartPole-v0 rllib/algorithm=ppo
Here the -cn conf_rllib
argument specifies to use the conf_rllib.yaml
(available in maze-rllib
) package, as our root config file.
It specifies the way how to use RLlib trainers within Maze.
(For more on root configuration files, see Hydra overview.)
Example 2: Overwriting Training Parameters¶
Similar to native Maze trainers, the parametrization of RLlib training runs is also done via Hydra. The main parameters for customizing training and are:
Environment (
env
configuration group), configuring which environment the training runs on, this stays the same as in maze-train for example.Algorithm (
rllib/algorithm
configuration group), specifies the algorithm and its configuration (all supported algorithms).Model (
model
configuration group), specifying how the models for policies and (optionally) critics should be assembled, this also stays the same as in maze-train.Runner (
rllib/runner
configuration group), specifies how training is run (e.g. locally, in development mode). The runner is also the main object responsible for administering the whole training run.. The runner is also the main object responsible for administering the whole training run.
To train with a different algorithm we simply have to specify the rllib/algorithm
parameter:
$ maze-run -cn conf_rllib env.name=CartPole-v0 rllib/algorithm=a3c
Furthermore, we have full access to the algorithm hyper parameters defined by RLlib and can overwrite them. E.g., to change the learning rate and rollout fragment length, execute
$ maze-run -cn conf_rllib env.name=CartPole-v0 rllib/algorithm=a3c \
algorithm.config.lr=0.001 algorithm.config.rollout_fragment_length=50
Example 3: Training with RLlib’s Default Models¶
Finally, it is also possible to utilize the RLlib default model builder by specifying model=rllib
.
This will load the rllib default model and parameters, which can again be customized via Hydra:
$ maze-run -cn conf_rllib env.name=CartPole-v0 model=rllib \
model.fcnet_hiddens=[128,128] model.vf_share_layers=False
Supported Algorithms¶
The Bigger Picture¶
The figure below shows an overview of how the RLlib Module connects to the different Maze components in more detail:

Good to Know¶
Tip
Using the the argument rllib/runner=dev
starts ray in local mode, by default sets the number workers to 1
and increases the log level (resulting in more information being printed). This is especially useful for debugging.
Tip
When watching the training progress of RLlib training runs with Tensorboard
make sure to start Tensorboard with --reload_multifile true
as both Maze and RLlib will dump an event log.
Where to Go Next¶
After training, you might want to rollout the trained policy to further evaluate it or record the actions taken.
To create a custom Maze environment, you might want to review Maze environment hierarchy and creating a Maze environment from scratch.
To build and use custom Maze models please refer to Maze Perception Module.
For more details on Hydra and how to use it go to configuration with Hydra.
You can read up on our general introduction to the Maze training workflow.
Policies, Critics and Agents¶
Depending on the domain and task we are working on we rely on different trainers (algorithm classes) to appropriately address the problem at hand. This in turn implies different requirements for the respective models and contained policy and value estimators.
The figure below provides a conceptual overview of the model classes contained in Maze and relates them to compatible algorithm classes and trainers.

Note that all policy and critics are compatible with Structured Environments.
Policies (Actors)¶
An agent holds one or more policies and acts (selects actions) according to these policies. Each policy consists of one ore more policy networks. This might be for example required in (1) multi-agent RL settings where each agents acts according to its distinct policy network or (2) when working with auto-regressive action spaces or multi-step environments.
In case of Policy Gradient Methods, such as the actor-critic learners A2C or PPO, we rely on a probabilistic policy defining a conditional action selection probability distribution \(\pi(a|s)\) given the current State \(s\).
In case of value-based methods, such as DQN, the Q-policy is defined via the state-action value function \(Q(s, a)\) (e.g, by selecting the action with highest Q value: \(\mathrm{argmax}_a Q(s, a)\)).
Value Functions (Critics)¶
Maze so far supports two different kinds of critics. A standard state critic represented via a scalar value function \(V(S)\) and a state-action critic represented either via a scalar state-action value function \(Q(S, A)\) or its vectorized equivalent \(Q(S)\) predicting the state-action values for all actions at once.
Analogously to policies each critic holds one or more value networks depending on the current usage scenario we are in (auto-regressive, multi-step, multi-agent, …). The table below provides an overview of the different critics styles.
State Critic \(V(S)\) |
|
Each sub-step or actor gets its individual state critic. |
|
One state critic is shared across all sub-steps or actors. |
|
State-Action Critic \(Q(S, A)\) |
|
Each sub-step or actor gets its individual state-action critic. |
|
One state-action critic is shared across all sub-steps or actors. |
Actor-Critics¶
To conveniently work with algorithms such as A2C, PPO, IMPALA or SAC we provide a
TorchActorCritic
model
to unifying the different policy and critic types into one model.
Where to Go Next¶
For further details please see the reference documentation.
To see how to actually implement policy and critic networks see the Perception Module.
You can see the list of available probability distributions for probabilistic policies.
You can also follow up on the available Maze trainers.
Maze Environment Hierarchy¶
When working with an environment, it is desirable to maintain some modularity in order to be able to, for example, test different configurations of action and observation spaces, modify or record rollouts, or turn an existing flat environment into a structured one.
This page explains how Maze achieves such modularity by breaking down the Maze environment into smaller components and utilizing wrappers. It also provides a high-level overview of what you need to do to use a new or existing custom environment with Maze. (You can find guidance on that at the end of each section.)
For more references on the individual components or on how to write a new environment from scratch, see the Where to Go Next section at the end.

The following sections describe the main components:
Core environment, which implements the main environment mechanics, and works with MazeState and MazeAction objects.
Observation- and ActionConversionInterfaces which turn MazeState and MazeAction objects (custom to the core environment) into actions and observations (instances of Gym-compatible spaces which can be fed into a model).
Maze env, which encapsulates the core environment and implements functionality common to all environments above it (e.g. manages observation and action conversion).
Wrappers, which add a great degree of flexibility by allowing you to encapsulate the environment and observe or modify its behavior.
Structured environment interface, which Maze uses to model more complex scenarios such as multi-step (auto-regressive), multi-agent or hierarchical settings.
Here, we explain what parts a Maze environment is composed of and how to apply wrappers.
Core Environment¶
Core environment implements the main mechanics and functionality of the environment.
Its interface is compatible with the Gym environment interface with functions such as
step
and reset
.
The step
function of the core environment takes an MazeAction
object and returns a MazeState object. There are no strict requirements on how these objects
should look – their structure is dependent on the needs of the core environment. However,
note that these objects should be serializable, so that they can be easily recorded
as part of trajectory data and inspected later.
Besides the Gym interface, core environment interface also contains a couple of hooks
that make it easy to support various features of maze, like recording trajectory of your
rollouts and then replaying these in a Jupyter notebook. These method include, e.g.,
get_renderer()
and get_serializable_components()
. You don’t have to use these
if you don’t need them (e.g. just return an empty dictionary to get_serializable_components()
if there are no additional components you would like to serialize with trajectory data) – but then,
some features of Maze might not be available.
If you want to use a new or existing environment with Maze, core environment is where you should start. Implement the core environment interface in your environment or encapsulate your environment in an core environment subclass.
Gym-Space Interfaces¶
Observation- and ActionConversionInterfaces translate MazeState and MazeAction objects (custom to the core environment) into actions and observations (instances of Gym-compatible spaces, i.e., usually a dictionary of numpy arrays which can be fed into a model) and vice versa.
It makes sense to extract this functionality in a separate objects, as format of actions and observations often needs to be swapped (to allow for different trained policies or heuristics). Treating space interfaces as separate objects encapsulates their configuration and separates it from the core environment functionality (which does not need to be changed when only, e.g., the format of the action space is being changed).
If you are creating a new Maze environment, you will need to implement at least one pair of interfaces – one for conversion of MazeStates into observations that your models can later consume, and other one for converting the actions produced by the model to the MazeActions your environment works with.
For more information on the space interfaces and how to customize your environment with them, refer to Customizing Core and Maze Environments.
Maze Environment¶
Maze environment encapsulates the core environment together with the space interfaces. Here, the functionality shared across all core environments is implemented – like management of the space interfaces, support for statistics and logging, and else.
Maze environment is the smallest unit that an RL agent can interact with, as it encapsulates the core functionality implemented by the core environment, space interfaces that translate the MazeState and MazeAction so that the model can understand it, and support for other optional features of Maze that you can add (like statistics logging).
If you are creating a new environment, you will likely not need to think of the Maze environment class much, as it is mostly concerned with functionalities shared across all Maze environments. You will still need to subclass it to have a distinct Maze environment class for your environment, but usually it is enough to override the initializer, there is no need to modify any of its other functionalities.
Wrappers¶
(This section provides an overview. See also Wrappers for more details.)
Wrappers are a very flexible way how to modify behavior of an environment. As the name implies, a wrapper encapsulates the whole environment in it. This means that the wrapper has complete control over the behavior of the environment and can modify it as suited.
Note also that another wrapper can also be applied to an already wrapped environment.
In this case, each method call (such as step
) will traverse through the whole wrapper
stack, from the outer-most wrapper to the Maze env, with each wrapper being able to
intercept and modify the call.
Maze provides superclasses for commonly used wrapper types:
ObservationWrapper can manipulate the observation before it reaches the agent. Observation wrappers are used for example for observation normalization wrapper or masking. Usually, this is the most common type of wrapper used.
RewardWrapper can manipulate the reward before it reaches the model.
ActionWrapper can manipulate the action the model produced before it is converted using ActionConversionInterface in Maze environment.
Wrapper is the common superclass of all the wrappers listed above. It can be subclassed directly if you need to provide some more elaborate functionality, like turning your flat environment into a structured multi-step one
If you are creating a new Maze environment, wrappers are optional. Unless you have some very special needs, the wide variety of wrappers provided by Maze (like observation normalization wrapper or trajectory recording wrapper) should work with any Maze environment out of the box. However, you might need to implement a custom wrapper if you want to modify the behavior of your environment in some more customized manner, like turning your flat environment into a structured multi-step one.
For more information on wrappers and customization, see Wrappers.
Structured Environments¶
This section provides an high-level overview of structured environments. See Control Flows with Structured Environments for more details and examples.
Maze uses the StructuredEnv
concept to model more complex settings, such as multi-step (auto-regressive),
multi-agent or hierarchical settings.
While such settings can indeed be quite complex,
the StructuredEnv
interface itself
is rather simple under the hood. In summary, during each step in the environment:
The agent needs to ask which policy should act next. The environment exposes this using the
actor_id
method.The agent then should evaluate the observation using the policy corresponding to the current actor ID, and issue the desired action using the
step
function in a usual Gym-like manner.
Note that the Actor ID, which identifies the currently active policy, is composed of (1) the sub-step key and (2) the index of the current actor in scope of this sub-step (as in some settings, there might be multiple actors per sub-step key).
Maze uses the StructuredEnv
interface in all settings by default, and other Maze components like
TorchPolicy
support it (and make it convenient to work with) out of the box.
Where to Go Next¶
After understanding how Maze environment hierarchy works, you might want to:
See how Hydra configuration works and how environments can be customized through it
See more about how to customize an existing environment with wrappers
Get more information on how to write a new Maze environment from scratch
See how Maze environments dispatch events to facilitate statistics collection and other forms of logging
Understand how policies and agents are structured
Also, note that the classes described above (like Core environment and Maze environment) themselves implement a set of interfaces that facilitate some of Maze functions, like EventEnvMixin interfacing the Event system or RenderEnvMixin facilitating rendering. You will likely not need to work with these directly, and hence they are not described here in detail. However, if you need to know more about these, head to the reference documentation.
Maze Event System¶
The Maze event system is a convenient way how to monitor notable events happening in the environment as the agent is interacting with it. This page explains the motivation behind it, gives an overview of how it is used in Maze (pointing to other relevant sections), and briefly explains how it works under the hood.
Motivation¶
Standard metrics such as reward and step count provide a high-level overview of how an agent is doing in an environment, but don’t provide more detailed information on the actual behavior.
On the other hand, visualizing or otherwise inspecting the full environment state gives very detailed information on the behavior in some particular time frame, but is difficult to compare and aggregate across episodes or training runs.
In Maze, event system fills the space between – providing more information about environment mechanics than just watching the reward, while making it easy to log, understand, and compare it across episodes and rollouts.
What is an event?¶
As the name suggests, an event is something notable that happens during the agent-environment interaction loop. For example, when the inventory is full in the example 2D cutting env, a piece will be discarded and the corresponding event will be fired:
self.inventory_events.piece_discarded(piece=(50, 30))
As can be seen above, events carry a descriptive name, encapsulate the details (like the dimensions of the discarded piece), and are part of a topic (like “inventory events”).
While there are some general events that apply to all environments (like reward-related events or KPIs), in general, environments declare their own topics and events as they see fit.
To understand how to declare and integrate custom events into your environment, see the adding events and KPIs tutorial.
How are events used in Maze?¶
There are three main things events are used for throughout Maze:
Reward aggregation. Reward aggregators declare which events they desire to observe, and then calculate the reward on top of them. This makes it possible to keep reward aggregators decoupled from the environment, which means they can be configured and changed easily. (Check out reward aggregation and the tutorial for more information.)
Statistics and KPIs. Event declarations can be annotated using decorators which specify how they should be aggregated on different levels (i.e., step, episode, and epoch). The statistics system then aggregates the events into statistics during trainings and rollouts, and displays these statistics in Tensorboard and console. This makes it possible to understand the agent’s behavior much better than if only high-level statistics such as reward and step count were observed. (For more information, see how statistics are logged and calculated.)
Raw event data logging. Events and their details are logged in CSV format, which makes them easy to access and analyze later via any custom tools. (While the CSV format should be suitable for most data-analysis tools out there, it is also possible to extend the logging functionality via custom writers if needed.)
For any other custom needs, it is possible to plug into the event system directly
through the Pubsub
or
EventEnvMixin
interfaces.
PubSub: Dispatching and Observing Events¶
Each core environment maintains its own Pubsub
message broker (stands for publisher-subscriber). Using the broker, it is possible to
register event topics (created as described in the tutorial),
register subscribers (which will then collect the dispatched events),
and dispatch events themselves.
# In a core env (which maintains a pubsub broker)
# Create a topic
inventory_events = self.pubsub.create_event_topic(InventoryEvents)
# Register a subscriber (can be a reward aggregator
# or any other class implementing the Subscriber interface)
self.pubsub.register_subscriber(my_subscriber)
# Dispatch an event
inventory_events.piece_discarded(piece=(50, 10))
Note that the subscriber must implement the
Subscriber
interface and declare which
events it want to be notified about. This pattern is used by
RewardAggregators
, and
the tutorial on adding reward aggregation
is also a good place to start for any other custom needs.
EventEnvMixin Interface: Querying Events¶
Core environment also records all events dispatched during the last time step
and makes them accessible using the EventEnvMixin
interface.
If you only need to query events dispatched during the last timestep, this option
might be more lightweight than registering with the
Pubsub
message broker.
env.get_step_events()
To see the interface in action, you might want to check out the
LogStatsWrapper
, which
uses this interface to query events for aggregation.
Where to Go Next¶
After understanding the main concepts of the event system, you might want to:
See how reward aggregation works and how to implement it in an environment from scratch
Check out the statistics logging in Tensorboard and console
Review how the events and KPI aggregation works
Configuration with Hydra¶
Here, we explain the configuration scheme of the Maze framework, which explain how to configure your environment and other components using YAML files, run your experiments via CLI, and customize the runs via CLI overrides.
The Maze framework utilizes the Hydra configuration framework. These pages aim to give you a quick overview of how Maze uses Hydra and what its capabilities are, so that you can get up to speed quickly without prior Hydra knowledge:
Hydra: Overview¶
The motivation behind using Hydra is primarily to:
Keep separate components (e.g., environment, policy) in individual YAML files which are easier to understand
Run multiple experiments with different components (like using two different environment configurations, or training with PPO vs. A2C) without duplicating the whole config file
Make components/values different from the defaults immediately visible (with, e.g.,
maze-run runner=sequential
)
Below, the core concepts of Hydra as Maze uses it are described:
Introduction explains the core concepts of assembling configuration with Hydra
Config Root & Defaults explains how the root config file works and how default components are specified
Overrides show how you can easily customize the config parameters without duplicating the config file, and have Hydra assemble the config and log it for you
Output Directory shows how Hydra creates separate directories for your runs automatically. It is a bit separated from the previous concepts but still important for running your jobs.
Runner concept section explains how the Hydra config is handled by Maze to launch various kinds of jobs (like rollout or train runs) with different configurations
Introduction¶
Hydra is a configuration framework that, in short, allows you to:
Break down your configuration into multiple smaller YAML files (representing individual components)
Launch your job through CLI providing overrides for individual components or values and have Hydra assemble the config for you
Ad (1): For illustrative purposes, this is an example of how your Hydra config structure can look like:

Ad (2): With the structure above, you could then launch your jobs with specified components (again, this is only for illustrative purposes):
$ maze-run runner=parallel
Or, you can even override any individual value anywhere in the config like this:
$ maze-run runner=parallel runner.n_processes=10
You can also review the basic example and composition example at Hydra docs.
Configuration Root, Groups and Defaults¶
The starting place for a Hydra config is the root configuration file. It lists (1) the individual
configuration groups that you would like to use along with their defaults, and (2) any other
configuration that is universal. A simple root config file might look like this (all of these examples
are snippets taken from maze
config, shortened for brevity):
# These are the individual config components with their defaults
defaults:
- runner: parallel
- env: cartpole
- wrappers: default
optional: true
# ...
# Other values that are universally applicable (still can be changed with overrides)
log_base_dir: outputs
# ...
The snippet runner: parallel
tells Hydra to look for a file runner/parallel.yaml
and
transplant its contents under the runner:
key. (If optional: true
is specified,
Hydra does not raise an error when such a config file cannot be found.)
Hence, if the runner/parallel.yaml
file looks like this:
n_processes: 5
n_episodes: 50
# ...
the final assembled config would look like this:
runner:
n_processes: 5
n_episodes: 50
# ...
env:
# ...
Overrides¶
When running your job through a command line, you can customize individual bits of your configuration via command-line arguments.
As briefly demonstrated above, you can override individual defaults in the config file.
For example, when running a Maze rollout, the default runner is parallel
, but
you could specify the sequential runner instead:
$ maze-run runner=sequential
Besides overriding specifying the config components, you can also override individual values in the config:
$ maze-run runner=sequential runner.max_episode_steps=1000
There is also more advanced syntax for adding/removing config keys and other patterns – for this, you can consult Hydra docs regarding basic overrides and extended override syntax.
Output Directory¶
Hydra also by default handles the output directory for each job you run.
By default, outputs
is used as the base output directory and a new subdirectory
is create inside for each run. Here, Hydra also logs the configuration for the current
job in the .hydra
subdirectory, so that you can always get back to it.
You can override the hydra output directory as follows:
$ maze-run hydra.run.dir=my_dir
More on the output directory setting can be found in Hydra docs: output/working directory and customizing working directory pattern.
Maze Runner Concept¶
In Maze, the maze-run
command (that you have seen above already) is the single central
utility for launching all sorts of jobs, like training or rollouts.
Under the hood, when you launch such a job, the following happens:
Maze checks the
runner
part of the Hydra configuration that was passed through the command. And instantiates a runner object from it (subclass ofRunner
).(The
runner
component of the configuration always specifies the Runner class to be instantiated, along with any other arguments it needs at initialization.)Maze then calls the
run
method on the instantiated runner and passes it the whole config, as obtained from Hydra.
This enables the maze-run
command to keep a lot of variability without much coupling
of the individual functionalities. For example, rollouts are run through subclasses
of RolloutRunner
and trainings through subclasses of
TrainingRunner
.
You are also free to create your own subclasses for rollouts, trainings or any completely different use cases.
Where to Go Next¶
After understanding the basics of how Maze uses Hydra, you might want to:
Try running a rollout using Hydra configuration through command line to put these ideas into action
Create custom Hydra configuration files for your project
Understand the advanced concepts of Hydra
Hydra: Your Own Configuration Files¶
We encourage you to add custom config or experiment files in your own project. These will make it easy for you to launch different versions of your environments and agents with different parameters.
To be able to use custom configuration files, you first need to create your config module and add it to the Hydra search path. Then, you can either
create your own config components (e.g., when you just need to customize the environment config),
create dedicated experiment config files based on the default (master) config,
or create your own root config file if you have more custom needs.
Step 1: Custom Config Module in Hydra Search Path¶
For this, first, create a module where your config will reside
(let’s say your_project.conf
) and place an __init__.py
file in there.
Second, to make your project available to Hydra, make sure it is either installed
using pip install -e .
, or added to your Python path manually, using for example
export PYTHONPATH="$PYTHONPATH:$PWD/"
when in the project directory.
As a final step, you need to tell Hydra to look for your config files.
This is can be done either by specifying your config directory along with each
maze-run
command using the -cd
flag:
$ maze-run -cd your_project/conf ...
Or, to avoid specifying this with every command, you can add your config module
to the Hydra search path by creating the following Hydra plugin
(substitute your_project.conf
with your actual config module path):
# Inside your project in: hydra_plugins/add_custom_config_to_search_path.py
"""Hydra plugin to register additional config packages in the search path."""
from hydra.core.config_search_path import ConfigSearchPath
from hydra.plugins.search_path_plugin import SearchPathPlugin
class AddCustomConfigToSearchPathPlugin(SearchPathPlugin):
"""Hydra plugin to register additional config packages in the search path."""
def manipulate_search_path(self, search_path: ConfigSearchPath) -> None:
"""Add custom config to search path (part of SearchPathPlugin interface)."""
search_path.append("project", "pkg://your_project.conf")
Now, you can add additional root config files as well as individual components into your config package.
For more information on search path customization, check Config Search Path and SearchPathPlugins in Hydra docs.
Step 2a: Custom Config Components¶
If what you are after is only providing custom options for some of the components Maze configuration uses (e.g., a custom environment configuration), then it suffices to add these into the relevant directory in your config module and you are good to go.
For example, if you want a custom configuration for the Gym Car Racing env, you might do:
# In your_project/conf/env/car_racing.yaml:
# @package env
_target_: maze.core.wrappers.maze_gym_env_wrapper.GymMazeEnv
env: CarRacing-v0
Then, you might call maze-run
with the env=car_racing
override and it will
load the configuration from your file.
Depending on your needs, you can mix-and-match your custom configurations
with configurations provided by Maze (e.g. use a custom env
configuration
while using a wrappers
or models
configuration provided by Maze).
Step 2b: Experiment Config¶
Another convenient way to assemble and maintain different configurations of your experiments is Hydra’s built-in Experiment Configuration.
It allows you to customize experiments by only specifying the changes to the default (master) configuration. You can for example change the trainer to PPO, the learning rate to 0.0001 and additionally activate the vector_obs wrapper stack by providing the following experiment configuration:
# @package _global_
# defaults to override
defaults:
- override /algorithm: ppo
- override /wrappers: vector_obs
# overrides
algorithm:
lr: 0.0001
To start the experiment from this experiment config file, run:
$ maze-run -cn conf_train +experiment=cartpole_ppo_wrappers
For more details on experimenting we refer to the experiment configuration docs.
Step 2c: Custom Root Config¶
If you require even more customization, you will likely need to define your own root config. This is usually useful for custom projects, as it allows you to create custom defaults for the individual config groups.
We suggest you start by copying one of the root configs already available in Maze
(like conf_rollout
or conf_train
, depending on what you need), and then adding
more required keys or removing those that are not needed. However, it is also
not difficult to start from scratch if you know what you need.
Once you create your root config file (let’s say your_project/conf/my_own_conf.yaml
),
it suffices to point Hydra to it via the argument -cn my_own_conf
, so your
command would look like this (for illustrative purposes):
$ maze-run -cn my_own_conf
Then, all the defaults and available components that Hydra will look for depend on what you specified in your new root config.
For an overview of root config, check out config root & defaults.
Step 3: Custom Runners (Optional)¶
If you want to launch different types of jobs than what Maze provides by default, like
implementing a custom training algorithm or deployment scenario that you would like to
run via the CLI, you will benefit from creating a custom Runner
.
You can subclass the desired class in the runner hierarchy (like the
TrainingRunner
if you are implementing a new training scheme, or the general Runner
for some more general concept). Then, just create a custom config file for the runner
config
group that configures your new class, and you are good to go.
Where to Go Next¶
After understanding how custom configuration is done, you might want to:
Review the Hydra overview to see how you should structure your custom configuration
Read about the advanced concepts of Hydra
Hydra: Advanced Concepts¶
This page features a collection of more advanced Maze and Hydra features that are used across the framework and power the configuration under the hood:
Factory, which Maze uses to turn configuration into instantiated objects, while allowing passing in already instantiated objects as well.
Interpolation, which allows you to reference parts of configuration from elsewhere.
Specializations, which allow you to load additional configuration files based on particular combinations of selected defaults.
Maze Factory¶
Factory
wraps around
Hydra’s own instantiation functionality and adds features like
type hinting and checking, collections, configuration structure checks,
and ability to take in already instantiated objects.
Using the factory, classes can accept
ConfigType
(or collections thereof,
CollectionOfConfigType
),
which stands for either an already instantiated object, or a dictionary
with configuration, which the factory will then use to build the instance.
Configuration dictionary consists of the _target_
attribute, along with any
arguments that the instantiated class takes, e.g. (here denoted in YAML, as you
will find it in many places across the framework):
_target_: maze.core.wrappers.maze_gym_env_wrapper.GymMazeEnv
env: CarRacing-v0
The factory then takes in the dictionary configuration (loaded from YAML using Hydra, or from anywhere else) and builds the object for you, checking that it is indeed of the expected type:
from maze.core.env.maze_env import MazeEnv
from maze.core.utils.factory import Factory
env = Factory(MazeEnv).instantiate({
"_target_": "maze.core.wrappers.maze_gym_env_wrapper.GymMazeEnv",
"env": "CarRacing-v0"
})
You can also pass in additional keyword arguments that the factory will then pass on to the constructor together with anything from the configuration dictionary:
from maze.core.env.maze_env import MazeEnv
from maze.core.utils.factory import Factory
env = Factory(MazeEnv).instantiate({
"_target_": "maze.core.wrappers.maze_gym_env_wrapper.GymMazeEnv",
}, env="CarRacing-v0")
If you pass in an already instantiated object instead of a configuation dictionary,
the instantiate
method will only check that it is of the expected type
and return it back. This allows components in Maze to be easily configurable
both from YAML/dictionaries and by passing in already instantiated objects.
Interpolation¶
Hydra is based on OmegaConf and supports interpolation.
Interpolation allows us to reference and reuse a value defined elsewhere in the configuration, without repeating it. For example:
original:
value: 1 # We want to reference this value elsewhere
some:
other:
structure: ${original.value} # Reference
A (somewhat limited) form of interpolation is used also in specializations described below.
Specializations¶
Specializations are parts of config that depend on multiple components.
For example, your wrapper configuration might depend on both
the environment chosen (e.g., gym_pixel_env
or gym_feature_env
) and
your model (e.g., default
or rnn
) – if using an RNN, you might want
to include ObservationStackWrapper, but its configuration also depends on the environment used.
Then, specializations come to the rescue. In your root config file, you can include a specialization like this (for illustrative purposes):
defaults:
- env: gym_pixel_env
- model: default
- env_model: ${defaults.1.env}-${defaults.2.model}
optional: true
Then, when you run this configuration with env=gym_pixel_env
and model=rnn
,
Hydra will look into the env_model
directory for configuration named gym_pixel_env-rnn.yaml
.
This allows you to capture the dependencies between these two components easily without
having to specify more overrides.
Specializations are well explained in Hydra docs here.
Where to Go Next¶
After understanding advanced Hydra configuration, you might want to:
Create custom Hydra configuration files for your project
Review the root configurations available in the Maze framework (as they are a good basis for your custom configurations)
Hydra: Overview explains the core concepts of configuration assembly, overrides and Maze runners controlling the CLI jobs. Hydra: Your Own Configuration Files shows how to get started with your own configuration in your custom projects. Hydra: Advanced Concepts explain other components and Hydra features that power Maze configuration under the hood, such as Maze factory, Hydra interpolations and specializations.
Environment Rendering¶
In cases when reviewing the statistics and event logs provided by the event system does not provide enough insight, rendering the environment state in a particular time step is helpful.
Maze supports two rendering modes:
Rendering online during the rollout. This is possible simply using the sequential rollout runner for a rollout, and setting the rendering flag to true using the following overrides:
runner=sequential runner.render=true
.Rendering offline, in a Jupyter notebook, from trajectory data collected earlier. For environments which provide a Maze-compatible render, rollouts can be rendered and browsed retroactively. Review collecting and visualizing rollouts for more details. (Unfortunately, this mode is not yet supported for ordinary Gym envs – unless a custom Maze-compatible renderer is provided.)
Structured Environments¶
The basic reinforcement learning formulation assumes a single entity in an environment, enacting one action suggested by a single policy per step to fulfill exactly one task. We refer to this as a flat environment. A classic example for this is the cartpole balancing problem, in which a single entity attempts to fulfill the single task of balancing a cartpole. However, some problems incentivize or even require to challenge these assumptions:
Single entity: Plenty of real-world scenarios motivate taking more than one acting entities into account. E.g.: optimizing delivery with a fleet of vehicles involves emergent effects and interdependences between individual vehicles, such as that the availability and suitability of orders for any given vehicle depends on the proximity and activity of other vehicles.Treating them in isolation from each other is inefficient and detrimental to the learning process. While it is possible to have a single agent represent and coordinate all vehicles simultaneously, it can be more efficient to train multiple agents to facilitate collaborative behaviour for one vehicle each. One action suggested by a single policy: Some usecases, such as cutting raw material according to customer specifications with as little waste as possible, necessarily involve a well-defined sequence of actions. Stock-cutting involves (a) the selection of a piece of suitable size and (b) cutting it in an appropriate manner. We know that (a) is always followed by (b) and that the latter is a necessary precondition for the former.We can incorporate this information in our RL control loop to facilitate a faster learning process by enforcing the execution of policies in a certain order: E.g. select, then cut. This entails that while the action is still chosen by the policy, the policy itself is chosen by the environment. The sequential nature of such actions often lends itself to action masking to increase learning efficiency 1.Exactly one task: Occasionally, the problem we want to solve cannot be neatly formulated as a single task, but consists of a hierarchy of tasks. This is exemplified by pick and place robots. They solve a complex task, which is reflected by the associated hierarchy of goals: The overall goal requires (a) reaching the target object, (b) grasping the target object, (c) moving the target object to target location and (d) placing the target object safey in the target location. Solving this task cannot be reduced to a single goal.
Maze addresses these problems by introducing StructuredEnv
. We cover some of its applications and their broader context, including literature and examples, in a series of articles:
Flat Environments¶
Note
- Recommended reads prior to this article:
All instantiable environments in Maze are subclasses of StructuredEnv
. Structured environments are discussed in Control Flows with Structured Environments, which we recommend to read prior to this article. Flat environments in our terminology are those utilizing a single actor and a single policy, i. e. a single actor, and conducting one action per step. Within Maze, flat environments are a special case of structured environments.
An exemplary implementation of a flat environment for the stock cutting problem can be found here.
Control Flow¶
Let’s revisit a classic depiction of a RL control flow first:

Simplified control flow within a flat scenario. The agent selects an action, the environment updates its state and computes the reward. There is no need to distinguish between different policies or agents since we only have one of each. actor_id()
should always return the same value.¶
A more general framework however needs to be able to integrate multiple agents and policies into its control flow. Maze does this by implementing actors, which are abstractions introduced in the RL literature to represent one policy applied on or used by one agent. The figure above collapses the concepts of policy, agent and actor into a single entity for the sake of simplicity. The actual control flow for a flat environment in Maze is closer to this:

More accurate control flow for a flat environment in Maze, showing how the actor mechanism integrates agent and policy. Dashed lines denote the exchange of information on demand as opposed to doing so passing it to or returning it from step()
.¶
A flat environment hence always utilizes the same actor, i.e. the same policy for the same agent. Due to the lack of other actors there is no need for the environment to ever update its active actor ID. The concept of actors is crucial to the flexibility of Maze, since it allows to scale up the number of agents, policies or both. This enables the application of RL to a wider range of use cases and exploit properties of the respective domains more efficiently.
Where to Go Next¶
Multi-Stepping¶
Note
- Recommended reads prior to this article:
We define multi-stepping as the execution of more than one action (or sub-step) in a single step. This is motivated by problem settings in which a certain sequence of actions is known a priori. In such cases incorporating this knowledge can significantly increase learning efficiency. The stock cutting problem poses an example: It is known, independently from the specifics of the environment’s state, that fulfilling a single customer order for a piece involves (a) picking a piece at least as big as the ordered item (b) cutting it to the correct size.
While it is not trivial to decide which items to pick for which orders and how to cut them, the sequence of piece selection before cutting is constant - there is no advantage to letting our agent figure it out by itself. Maze permits to incorporate this sort of domain knowledge by enabling to select and execute more than one action in a single step. This is done by utilizing the actor mechanism to execute multiple policies in a fixed sequence.
In the case of the stock cutting problem two policies could be considered: “select” and “cut”. The piece selection action might be provided to the environment at the beginning of each step, after which the cutting policy - conditioned on the current state with the already selected piece - can be queried to produce an appropriate cutting action.
An implementation of a multi-stepping environment for the stock cutting problem can be found here.
Control Flow¶
In general, the control flow for multi-stepping environments involve at least two policies and one agent. It is easily possible, but not necessary, to include multiple agents in a multi-step scenario. The following image depicts a multi-step setup with one agent and an arbitrary number of sub-steps/policies.

Control flow within a multi-stepping scenario assuming a single agent. The environment keeps track of the active step and adjusts its policy key (via actor_id()
) accordingly. Dashed lines denote the exchange of information on demand as opposed to doing so passing it to or returning it from step()
.¶
When comparing this to the control flow depicted in the article on flat environments you’ll notice that here we consider several policies and therefore several actors - more specifically, in a setup with n sub-steps (or actions per step) we have at least n actors. Consequently the environment has to update its active actor ID, which is not necessary in flat environments.
Relation to Hierarchical RL¶
Hierarchical RL (HRL) describes a hierarchical formulation of reinforcement learning problems: tasks are broken down into (sequences of) subtasks, which are learned in a modular manner. Multi-stepping shares this property with HRL, since it also decomposes a task into a series of subtasks. Furthermorek, the multi-stepping control flow bears strong similarity to the one for hierarchical RL<struct_env_hierarchical> - in fact, multi-stepping could be seen as a special kind of hierarchical RL with a fixed task sequence and a single level of hierarchy.
Relation to Auto-Regressive Action Distributions¶
Multi-stepping is closely related to auto-regressive action distributions (ARAD) as used in in DeepMind’s Grandmaster level in StarCraft II using multi-agent reinforcement learning. Both ARADs and multi-stepping are motivated by a lack of temporal coherency in the sequence of selected actions: if there is some necessary, recurring order of actions, it should be identified it as quickly as possible.
ARADs still execute one action per step, but condition it on the previous state and action instead of the state alone. This allows them to be more sensitive towards such recurring patterns of actions. Multi-stepping allows to incorporate domain knowledge about the correct order of actions or tasks without having to rely on learned autoregressive policies learning, but depends on the environment to incorporate it. ARAD policies do not presuppose (and cannot make use of) any such prior knowledge.
ARADs are not explicitly implemented in Maze, but can be approximated. This can be done by including prior actions in observations supplied to the agent, which should condition the used policy on those actions. If relevant domain knowledge is available, we recommend to implement the multi-stepping though.
Where to Go Next¶
Multi-Agent RL¶
Note
- Recommended prior to this article:
Multi-agent reinforcment learning (MARL) describes a setup in which several collaborating or competing agents suggest actions for at least one of an environment’s acting entitites 1 each. This introduces the additional complexity of emergent effects between those agents. Some problems require to or at least benefit from deviating from a single-agent formulation, such as the vehicle routing problem, (video) games like Starcraft, traffic coordination, power systems and smart grids and many others.
Maze supports multi-agent learning via structured environment. In order to make a StructuredEnv
compatible with such a setup, it needs to keep track of the activities of each agent internally. How this is done and the order in which sequence agents enacted their actions is entirely to the environment. As per customary for a structured environment, it is required to provide the ID of the active actor via actor_id()
(see here for more information on the distinction between actor and agent). There are no further prequisites to use multiple agents with an environment.
Control Flow¶
It is easily possible, but not necessary, to include multiple policies in a multi-agent scenario. The control flow with multiple agents and a single policy can be summarized like this:

Control flow within a multi-agent scenario assuming a single policy. Dashed lines denote the exchange of information on demand as opposed to doing so passing it to or returning it from step()
.¶
When comparing this to the control flow depicted in the article on flat environments you’ll notice that here we consider several agents and therefore several actors - more specifically, in a setup with n agents we have at least n actors. Consequently the environment has to update its active actor ID, which is not necessary in flat environments.
Where to Go Next¶
Gym-style flat environments as a special case of structured environments.
Multi-stepping applies the actor mechanism to enact several policies in a single step.
Hierarchical RL by chaining and nesting tasks via policies..
- 1
We use “acting entity” in this context in the sense of something that acts or is manipulated to act in order to solve a given problem. E.g.: In the case of the vehicle routing problem it is neither desired nor should it be possible for an agent to change the layout of the map or how orders are generated, since these factors constitute a part of the problem setting. Instead, the goal is to learn a vehicle routing behaviour that is optimal w.r.t. processing the generated orders - the vehicles are acting entities. In MARL settings it is customary to map one agent to one manipulable entity, hence the term “agent” itself is often used to refer to the manipulable entity it represents.
Hierarchical RL¶
Note
Reinforcement learning is prone to scaling and generalization issues. With large action spaces, it takes a lot of time and resources for agents to learn the desired behaviour successfully. Hierarchical reinforcement learning (HRL) attempts to resolve this by decomposing tasks into a hierarchy of subtasks. This addresses the curse of dimensionality by mitigating or avoiding the exponential growth of the action space.
Beyond reducing the size of the action space HRL also provides an opportunity for easier generalization. Through its modularization of tasks, learned policies for super-tasks may be reused even if the underlying sub-tasks change. This enables transfer learning between different problems in the same domain.
Note that the action space can also be reduced with action masking as used in e.g. StarCraft II: A New Challenge for Reinforcement Learning, which indicates the invalidity of certain actions in the observations provided to the agent. HRL and action masking can be used in combination. The latter doesn’t address the issue of generalization and transferability though. Whenever possible and sensible we recommend to use both.
Motivating Example¶
Consider a pick and place robot. It is supposed to move to an object, pick it up, move it to a different location and place it there. It consists of different segments connected via joints that enable free movement in three dimensions and a gripper able to grasp and hold the target object. The gripper may resemble a pair of tongs or be more complex, e.g. be built to resemble a human hand.
A naive approach would present all possible actions, i.e. rotating the arm segments and moving the gripper, in a single action space. If the robot’s arm segments can move in \(n\) and its gripper in \(m\) different ways, a flat action space would consist of \(n * m\) different actions.
The task at hand can be intuitively represented as a hierarchy however: The top-level task is composed of the task sequence of “move”, “grasp”, “move”, “place”. This corresponds to a top-level policy choosing one of three sub-policies enacting primitive actions, i.e. arm or gripper movements. This enables the reusability of individual (sub-)policies for other tasks in the same domain.
Control Flow¶

Control flow within a HRL scenario assuming a single agent. The task hierarchy is built implicitly in step()
. Dashed lines denote the exchange of information on demand as opposed to doing so passing it to or returning it from step()
.¶
The control flow for HRL scenarios doesn’t obviously reflect the hierarchical aspect. This is because the definition and execution of the task hierarchy happens implicitly in step()
: the environment determines which task is currently active and which task should be active at the end of the current step. This allows for an arbitrarily flexible and complex task dependency graph. The possibility to implement a different ObservationConversionInterface
and ActionConversionInterface
for each policy enables to tailor actions and observations to the respective task.
This control flow bears strong similarity to the one for multi-stepping<struct_env_multistep> - in fact, multi-stepping could be seen as a special kind of hierarchical RL with a fixed task sequence and a single level of hierarchy.
Where to Go Next¶
Beyond Flat Environments with Actors¶
StructuredEnv
bakes the concept of actors into its control flow.
An actor describes a specific policy that is applied on - or used by - a specific agent. They are uniquely identified by the agent’s ID and the policy’s key. From a more abstract perspective an actor describes which task should be done (the policy) for whom (the agent respectively the acting entities it represents). In the case of the vehicle routing problem an agent might correspond to a vehicle and a policy might correspond to a task like “pick an order” or “drive to point X”. A StructuredEnv
has exactly one active actor at any time. There can be an arbitrary number of actors. They can be created and destroyed dynamically by the environment, by respectively specifying their ID or marking them as done. Their lifecycles are thus flexible, they don’t have to be available through the entirety of the environment’s lifecycle.

Overview of control flow with structured environments. Note that the line denoting the communication of the active actor ID is dashed because it is not returned by step()
, but instead queried via actor_id()
.¶
Decoupling actions from steps
The actor mechanism decouples actions from steps, thereby allowing environments to query actions for its actors on demand, not just after a step has been completed. The cardinality between involved actors and steps is therefore up to the environment - one actor can be active throughout multiple steps, one step can utilize several actors, both or neither (i.e. exactly one actor per step).
The discussed stock cutting problem for example might have policies with the keys “selection” or “cutting”, both of which take place in a single step; the pick and place problem might use policies with the keys “reach”, “grasp”, “move” or “place”, all of which last one to several steps.
Support of multiple agents and policies
A multi-agent scenario can be realized by defining the corresponding actor IDs under consideration of the desired number of agents. Several actors can use the same policy, which infers the recommended actions for the respective agents. Note that it is only reasonable to add a new policy if the underlying process is distinct enough from the activity described by available policies.
In the case of the vehicle routing problem using separate policies for the activies of “fetch item” and “deliver item” are likely not warranted: even though they describe different phases of the environment lifecycle, they describe virtually the same activity. While Maze provides default policies, you are encouraged to write a customized policy that fits your use case better - see Policies, Critics and Agents for more information.
Selection of active actor
Every StructuredEnv
is required to implement actor_id()
, which returns the ID of the currently active actor. An environment with a single actor, e. g. a flat Gym environment, may return a single-actor signature such as (0, 0). At any time there has to be exactly one active actor ID.
Policy-specific space conversion
Since different policies may benefit from or even require a different preprocessing of their actions and/or observations (especially, but not exclusively, for action masking), Maze requires the specification of a corresponding ActionConversionInterface
and ObservationConversionInterface
class for each policy. This permits to tailor actions and/or observations to the mode of operation of the relevant policy.
The actor concept and the mechanisms supporting it are thus capable of
representing an arbitrary number of agents;
identifying which policy should be applied for which agent via the provision of
actor_id()
;representing an arbitrary number of actors with flexible lifecycles that may differ from their environment’s;
supporting an arbitrary nesting of policies (and - in further abstraction - tasks);
selecting actions via the policy fitting the currently active actor;
preprocessing actions and observations w.r.t. the currently used actor/policy.
This allows to bypass the three restrictions laid out at the outset.
Where to Go Next¶
Read about some of the patterns and capabilities possible with structured environments:
The underlying communication pathways are identical for all instances of StructuredEnv
. Multi-stepping, multi-agent, hierarchical or other setups are all particular manifestations of structured environments and its actor mechanism. They are orthogonal to each other and can be used in any combination.
- 1
Action masking is used for many problems with large action spaces which would otherwise intractable, e.g. StarCraft II: A New Challenge for Reinforcement Learning.
High-level API: RunContext¶
This page describes the RunContext, a high-level API for training, rollout and evaluation of Maze agents in plain Python.
Motivation¶
Maze utilizes Hydra to facilitate a powerful configuration mechanism boosting developers’ flexibility in their creation of reinforcement learning projects. Hydra however is geared towards command line usage, so these benefits are not accessible when working with individual Maze components (like Trainer
, Wrapper
, MazeEnv
, …) and composing them manually in Python.
E.g.: It is not possible to generate components directly from the provided configuration modules. This would however be quite useful, as it allows to loads pre-configured (sets of) components. This can be exemplified by the pixel_obs wrapper configuration module, which defines several wrappers useful for the preprocessing, normalization and logging of such pixel space observations. Via the CLI this can be loaded trivially via ... wrappers=pixel_obs ...
- yet there is no obvious way to leverage Maze’ Hydra-based configuration system from within a Python script. This also affects other features, like the inability to instantiate objects from a YAML-/dict-based configuration object (which can be very convenient with increasing experiment or application complexity).
This motivates the introduction of RunContext
, a high-level API for training, rollout and evaluation. When working with Maze from within a Python script (as opposed to via the CLI with maze-run
) we highly recommend that you start with RunContext
: It requires very little configuration overhead to get things rolling, yet offers a lot of flexibility if you require additional configuration. While there might be cases where this is not sufficient, we expect that this would not happen too frequently.
Comparison with the CLI (maze-run)¶
We designed RunContext
to be largely congruent with the CLI, i.e. maze-run
. It utilizes Hydra internally and offers the same base functionality, but differs in a couple of ways - RunContext
…
… is a recent addition and still lacks support for a number of capabilities: Rolling out a policy is not fully supported yet, as are RLlib integration and some of the more advanced Hydra features like multi-runs. These issues (particularly rollout support) are on our todo list however and will be implemented shortly.
… accepts (most) components to be specified as instantiated complex Python objects, configuration dictionaries or configuration module name. In contrast, the CLI accepts the specification of components as configuration module name or as primitive values. As of now this entails however that once instantiated Python objects are passed, the customary experiment configuration cannot be logged anymore due to a lack of knowledge about the corresponding configuration dictionary. This issue is on our roadmap.
… offers a few additional options for convenience’ sake, such as output suppression via
silent=True
or setting the working directory viarun_dir='...'
.
Usage¶
This section aims to convey the principal ideas and features of RunContext
. For further explanation and a detailed discussion of the exposed interface as well as auxiliary components and utilities see here.
Initialization
As mentioned previously, the RunContext
API is largely congruent with the maze-run
CLI. Consequently the initialization can be done in a similar fashion. Here is one example with a particular training run configuration using the CLI, RunContext
initialized with configuration module names and RunContext
initialized with a mix of configuration module names and complex Python objects.
maze-run -cn conf_train env.name=CartPole-v0 algorithm=a2c model=vector_obs critic=template_state
rc = RunContext(
algorithm="a2c",
overrides={"env.name": "CartPole-v0"},
model="vector_obs",
critic="template_state"
)
alg_config = A2CAlgorithmConfig(
n_epochs=1,
epoch_length=25,
patience=15,
critic_burn_in_epochs=0,
n_rollout_steps=100,
lr=0.0005,
gamma=0.98,
gae_lambda=1.0,
policy_loss_coef=1.0,
value_loss_coef=0.5,
entropy_coef=0.00025,
max_grad_norm=0.0,
device='cpu',
rollout_evaluator=RolloutEvaluator(
eval_env=SequentialVectorEnv([lambda: GymMazeEnv("CartPole-v0")]),
n_episodes=1,
model_selection=None,
deterministic=True
)
)
rc = RunContext(
algorithm=alg_config,
overrides={"env.name": "CartPole-v0"},
model="vector_obs",
critic="template_state"
)
Environments cannot be passed in instantiated form, but instead as callable environment factories:
rc = RunContext(env=lambda: GymMazeEnv('CartPole-v0'))
As with the CLI, any attribute in the configuration hierarchy can be overridden, not just the explicitly exposed top-level attributes like env
or algorithm
. This can be achieved using the overrides
dictionary as seen above for "env.name"
. It is also possible to pass complex values:
policy_composer_config = { '_target_': 'maze.perception.models.policies.ProbabilisticPolicyComposer', 'networks': [{ '_target_': 'maze.perception.models.built_in.flatten_concat.FlattenConcatPolicyNet', 'non_lin': 'torch.nn.Tanh', 'hidden_units': [256, 256] }], "substeps_with_separate_agent_nets": [], "agent_counts_dict": {0: 1} } rc = RunContext(overrides={"model.policy": policy_composer_config})policy_composer = ProbabilisticPolicyComposer( action_spaces_dict=env.action_spaces_dict, observation_spaces_dict=env.observation_spaces_dict, distribution_mapper=DistributionMapper(action_space=env.action_space, distribution_mapper_config={}), networks=[{ '_target_': 'maze.perception.models.built_in.flatten_concat.FlattenConcatPolicyNet', 'non_lin': 'torch.nn.Tanh', 'hidden_units': [222, 222] }], substeps_with_separate_agent_nets=[], agent_counts_dict={0: 1} ) rc = RunContext(overrides={"model.policy": policy_composer})
Note that by design configuration module name resolution is not triggered for attributes in overrides
. This is necessary for some of the explicitly exposed arguments however. We recommend strongly to pass an argument explicitly, if it is explicitly exposed - otherwise a correct assembly of the underlying configuration structure cannot be guaranteed. E.g. if you want to pass an instantiated algorithm configuration like
alg_config = A2CAlgorithmConfig(
n_epochs=1,
epoch_length=25,
deterministic_eval=False,
eval_repeats=2,
patience=15,
critic_burn_in_epochs=0,
n_rollout_steps=100,
lr=0.0005,
gamma=0.98,
gae_lambda=1.0,
policy_loss_coef=1.0,
value_loss_coef=0.5,
entropy_coef=0.00025,
max_grad_norm=0.0,
device='cpu'
)
then
rc = RunContext(algorithm=alg_config)
rc = RunContext(overrides={"algorithm": alg_config})
Further examples of how to use Maze with both the CLI and the high-level API can be found here.
Training
Training is straightforward with an initialized RunContext
:
rc.train()
# Or with a specified number of epochs:
rc.train(n_epochs=10)
train()
passes on all accepted arguments to the instantiated trainer. At the very least the number of epochs to train can be specified, everything else depends on the arguments that the corresponding trainer exposes. See here for further information on trainers in Maze. If no arguments are specified, Maze uses the default values included in the loaded configuration.
Rollout
Rollouts are not supported directly yet, but can be implemented manually:
env_factory = lambda: GymMazeEnv('CartPole-v0')
rc = run_context.RunContext(env=lambda: env_factory())
rc.train()
# Run trained policy.
env = env_factory()
obs = env.reset()
for i in range(10):
action = rc.compute_action(obs)
obs, rewards, dones, info = env.step(action)
Evaluation
To evaluate a trained policy, use the integrated evaluation functionality.
rc = RunContext(env=lambda: GymMazeEnv('CartPole-v0'))
rc.train()
rc.evaluate()
Customizing Core and Maze Envs¶
Whenever simulations reach a certain level of complexity or (ideally) already exist, but have been developed for other purposes than the RL scenario, the Gym-style environment interfaces might not be sufficient anymore to meet all technical requirements (e.g., the state is too complex to be represented as a simple Gym-style numpy array). In case of existing simulations it probably was not even taken into account at all and we have to deal with simulation specific interfaces and objects.
To cope with such situations Maze introduces a few additional concepts which are summarized in the figure below. Before we continue with some practical examples emphasizing why this structure is useful for environment customization and convenient experimentation, we first describe the concepts and components in a bit more detail. You can also find these components in the reference documentation.

Observation- and ActionConversionInterfaces:
Maze introduces MazeStates and MazeActions, extending Observations and Actions (represented as numerical arrays) to simulation specific generic objects. This grants more freedom in choosing appropriate environment-specific representations to separate the data model from the numerical representation, which in turn greatly simplifies the development and readability of environment and engineered baseline agent implementations.
Action: the Gym-style, machine readable action.
MazeAction: the simulation specific representation of the action (e.g., an arbitrary Python object).
ActionConversionInterface: maps agent actions to environment (simulation) specific MazeActions and defines the respective Gym action space.
Observation: the Gym-style, machine readable observation (e.g., a numpy array).
MazeState: the simulation specific representation of the observation (e.g. an arbitrary Python object).
ObservationConversionInterface: maps simulation MazeStates to Gym-style observations and defines the respective Gym observation space.
Core and Maze Environments:
The same distinction is carried out for environments.
CoreEnv: this is the central environment, which could be also seen as the simulation, forming the basis for actual, RL trainable environments. CoreEnvs accept MazeAction objects as input and yield MazeState objects as response.
CoreEnv Config: configuration parameters for the CoreEnvironment (the simulation).
MazeEnv: wraps the CoreEnvs as a Gym-style environment in a reusable form, by utilizing the interfaces (mappings) from the MazeState to the observations space and from the MazeAction to the action space.
List of Features¶
Introducing the concepts outlined above allows the following:
Implement and maintain observations and actions as arbitrarily complex, simulation specific objects (MazeStates and MazeActions). In many cases sticking to Gym spaces gets quite cumbersome and makes the development processes unnecessarily complex.
Easily experiment with different observation and action spaces simply by switching the Observation- and ActionConversionInterface.
Train agents based on existing 3rd party simulations (environments) by implementing the Observation- and ActionConversionInterfaces (of course this also requires to have a Python API available).
Easy configuration of the CoreEnv (simulation).
Example: Core- and MazeEnv Configuration¶
The config snippet below shows an example environment configuration for the built-in cutting-2d environment.
# @package env
_target_: maze_envs.logistics.cutting_2d.env.maze_env.Cutting2DEnvironment
# parametrizes the core environment (simulation)
core_env:
max_pieces_in_inventory: 1000
raw_piece_size: [100, 100]
demand_generator:
_target_: mixed_periodic
n_raw_pieces: 3
m_demanded_pieces: 10
rotate: True
# defines how rewards are computed
reward_aggregator:
_target_: maze_envs.logistics.cutting_2d.reward.default.DefaultRewardAggregator
# defines the conversion of actions to executions
action_conversion
- _target_: maze_envs.logistics.cutting_2d.space_interfaces.action_conversion.dict.ActionConversion
max_pieces_in_inventory: 1000
# defines the conversion of states to observations
observation_conversion:
- _target_: maze_envs.logistics.cutting_2d.space_interfaces.observation_conversion.dict.ObservationConversion
max_pieces_in_inventory: 1000
raw_piece_size: [100, 100]
The config defines:
which MazeEnv to use,
the parametrization of the CoreEnv including reward computation,
how MazeStates are converted to observations and
how actions are converted to MazeActions.
All components together compose a concrete RL problem instance as a trainable environment. In particular, whenever you would like to experiment with specific aspects of your RL problem (e.g. tweak the observation space) you only have to exchange the respective part of your environment configuration.
Note
As showing concrete implementations of a CoreEnv or the Observation- and ActionConversionInterfaces is beyond the scope of this page we refer to the Maze - step by step tutorial for details.
Where to Go Next¶
You might want to get a bigger picture of the Maze environment hierarchy.
Learn how to customize with environment wrappers.
Learn about reward customization and shaping.
See the special wrappers for observation pre-processing and observation normalization.
Customizing / Shaping Rewards¶
In a reinforcement learning problem the overall goal is defined via an appropriate reward signal. In particular, reward is attributed to certain, problem specific key events and the current environment state. During the training process the agent then has to discover a policy (behaviour) that maximizes the cumulative future reward over time. In case of a meaningful reward signal such a policy will be able to successfully address the decision problem at hand.

From a technical perspective, reward customization in Maze is based on the environment state
in combination with the general event system
(which also serves other purposes), and is implemented via
RewardAggregators
.
In summary, after each step, the reward aggregator gets access to the environment state,
along with all the events the environment dispatched
during the step (e.g., a new item was replenished to inventory), and can then calculate arbitrary rewards
based on these. This means it is possible to modify and shape the reward signal based on
different events and their characteristics by plugging in different reward aggregators
without further modifying the environment.
Below we show how to get started with reward customization by configuring the CoreEnv and by implementing a custom reward.
List of Features¶
Maze event-based reward computation allows the following:
Easy experimentation with different reward signals.
Implementation of custom rewards without the need to modify the original environment (simulation).
Computing simple rewards based on environment state, or using the full flexibility of observing all events from the last step
Combining multiple different objectives into one multi-objective reward signal.
Computation of multiple rewards in the same env, each based on a different set of components (multi agent).
Configuring the CoreEnv¶
The following config snippet shows how to specify reward computation for a CoreEnv
via the field reward_aggregator
.
You only have to set the reference path of the RewardAggregator and reward computation will be
carried out accordingly in all experiments based on this config.
For further details on the remaining entries of this config you can read up on how to customize Core- and MazeEnvs.
# @package env
_target_: maze_envs.logistics.cutting_2d.env.maze_env.Cutting2DEnvironment
# parametrizes the core environment (simulation)
core_env:
max_pieces_in_inventory: 1000
raw_piece_size: [100, 100]
demand_generator:
_target_: mixed_periodic
n_raw_pieces: 3
m_demanded_pieces: 10
rotate: True
# defines how rewards are computed
reward_aggregator:
_target_: maze_envs.logistics.cutting_2d.reward.default.DefaultRewardAggregator
# defines the conversion of actions to executions
action_conversion
- _target_: maze_envs.logistics.cutting_2d.space_interfaces.action_conversion.dict.ActionConversion
max_pieces_in_inventory: 1000
# defines the conversion of states to observations
observation_conversion:
- _target_: maze_envs.logistics.cutting_2d.space_interfaces.observation_conversion.dict.ObservationConversion
max_pieces_in_inventory: 1000
raw_piece_size: [100, 100]
Implementing a Custom Reward¶
This section contains a concrete implementation of a reward aggregator for the built-in cutting environment (which bases its reward solely on the events from the last step, as that is more suitable than checking current environment state).
In summary, the reward aggregator first declares which events it is interested in (the get_interfaces method). At the end of the step, after all the events have been accumulated, the reward aggregator is asked to calculate the reward (the summarize_reward method). This is the core of the reward computation – you can see how the events are queried and the reward assembled based on their values.
"""Assigns negative reward for relying on raw pieces for delivering an order."""
from typing import List, Optional
from maze.core.annotations import override
from maze.core.env.maze_state import MazeStateType
from maze.core.events.pubsub import Subscriber
from maze_envs.logistics.cutting_2d.env.events import InventoryEvents
from maze.core.env.reward import RewardAggregatorInterface
class RawPieceUsageRewardAggregator(RewardAggregatorInterface):
"""
Reward scheme for the 2D cutting env penalizing raw piece usage.
:param reward_scale: Reward scaling factor.
"""
def __init__(self, reward_scale: float):
super().__init__()
self.reward_scale = reward_scale
@override(Subscriber)
def get_interfaces(self) -> List:
"""
Specification of the event interfaces this subscriber wants to receive events from.
Every subscriber must implement this configuration method.
:return: A list of interface classes.
"""
return [InventoryEvents]
@override(RewardAggregatorInterface)
def summarize_reward(self, maze_state: Optional[MazeStateType] = None) -> float:
"""
Summarize reward based on the orders and pieces to cut, and return it as a scalar.
:param maze_state: Not used by this reward aggregator.
:return: the summarized scalar reward.
"""
# iterate replenishment events and assign reward accordingly
reward = 0.0
for _ in self.query_events(InventoryEvents.piece_replenished):
reward -= 1.0
# rescale reward with provided factor
reward *= self.reward_scale
return reward
When adding a new reward aggregator you (1) have to implement the
RewardAggregatorInterface
and
(2) make sure that it is accessible within your Python path.
Besides that you only have to provide the reference path of the reward_aggregator
to use:
reward_aggregator:
_target_: my_project.custom_reward.RawPieceUsageRewardAggregator
reward_scale: 0.1
Where to Go Next¶
Additional options for customizing environments can be found under the entry “Environment Customization” in the sidebar.
For further technical details we highly recommend to read up on the Maze event system.
To see another application of the event system you can read up on the Maze logging system.
Environment Wrappers¶
Environment wrappers are an elegant way to modify and customize environments for RL training and experimentation. As the name already suggests, they wrap an existing environment and allow to modify different parts of the agent-environment interaction loop including observations, actions, the reward or any other internals of the environment itself.

To gain access to the functionality of Maze environment wrappers you simply have to add a wrapper stack in your hydra configuration. To get started just copy one of our hydra config snippets below or use it directly within Python.
Note
Wrappers have been already introduced in OpenAi’s Gym and elegantly expose methods and attributes of all nested envs. However, wrapping destroys the class hierarchy, querying the base classes is not straight-forward. Maze environment wrappers fix the behaviour of isinstance() for arbitrarily nested wrappers.
List of Features¶
Maze environment wrappers allows the following:
Easy customization of environments: (observations, actions, reward, internals)
Convenient development of concepts such as observation pre-processing and observation normalization.
Preserves the class hierarchy of nested environments.
Example 1: Customizing Environments with Wrappers¶
To make use of Maze environment wrappers just add a config snippet as listed below.
# @package wrappers
RandomResetWrapper:
min_skip_steps: 0
max_skip_steps: 100
maze.core.wrappers.time_limit_wrapper.TimeLimitWrapper:
max_episode_steps: 1000
Details:
It applies the specified wrappers in the defined order from top to bottom.
Adds a RandomResetWrapper randomly skipping the first 0 to 100 frames
Adds a TimeLimitWrapper restricting the maximum temporal horizon of the environment
Example 2: Using Custom Wrappers¶
In case the built-in wrappers provided with Maze are not sufficient for your use case you can of course implement and add additional custom wrappers.
# @package wrappers
my_project.wrappers.custom_wrapper.CustomObserverWrapper:
parameter_1: 0.5
parameter_2: 1000
When adding a new environment wrappers you (1) have to implement the Wrapper interface and (2) make sure that it is accessible within your Python path. Besides that you only have to provide the reference path of the wrapper to use, plus any parameters the wrapper initializer needs.
Example 3: Plain Python Configuration¶
If you are not working with the Maze command line interface but still want to use wrappers directly within Python you can start with the code snippet below.
"""Contains an example showing how to add wrappers."""
from maze.core.wrappers.random_reset_wrapper import RandomResetWrapper
from maze.core.wrappers.time_limit_wrapper import TimeLimitWrapper
# instantiate the environment
env = ...
# apply wrappers
env = RandomResetWrapper.wrap(env, min_skip_steps=0, max_skip_steps=100)
env = TimeLimitWrapper.wrap(env, max_episode_steps=1000)
Built-in Wrappers¶
Maze already comes with built-in environment wrappers. You can find a list and further details on the functionality of the respective wrappers in the reference documentation.
For the following wrappers we also provide a more extensive documentation:
Where to Go Next¶
For further details please see the reference documentation.
Special wrapper for observation pre-processing and observation normalization.
You might also want to read up on the Maze environment hierarchy.
Observation Pre-Processing¶
Sometimes it is required to pre-process or modify observations before passing them through our policy or value networks. This might be for example the conversion of an three channel RGB image to a single channel grayscale image or the one-hot encoding of a categorical observation such as the current month into a feature vector of length 12. Maze supports observation pre-processing via the PreProcessingWrapper.

This means to gain access to observation pre-processing and to the features listed below you simply have to add the PreProcessingWrapper to your wrapper stack in your Hydra configuration.
To get started you can also just copy one of our Hydra config snippets or use it directly from Python.
List of Features¶
Maze observation pre-processing supports:
Gym dictionary observation spaces
Individual pre-processors for all sub-observations of these dictionary spaces
Cascaded pre-processing pipelines for a single observation (e.g. first convert an image to grayscale before inserting an additional dimension from the left for CNN processing)
The option to keep both, the original as well as the pre-processed observation
Implicit update of affected observation spaces according to the pre-processor functionality
Example 1: Observation Specific Pre-Processors¶
This example adds pre-processing to two observations (rgb_image and categorical_feature) contained in a dictionary observation space.
# @package wrappers
maze.core.wrappers.observation_preprocessing.preprocessing_wrapper.PreProcessingWrapper:
pre_processor_mapping:
- observation: rgb_image
_target_: maze.preprocessors.Rgb2GrayPreProcessor
keep_original: true
config:
num_flatten_dims: 2
- observation: categorical_feature
_target_: maze.preprocessors.OneHotPreProcessor
keep_original: false
config: {}
Details:
Adds a gray scale converted version of observation rgb_image to the observation space but also keeps the original observation.
Replaces the observation categorical_feature with an one-hot encoded version and drops the original observation.
Observations space after pre-processing: {rgb_image, rgb_image-rgb2gray, categorical_feature-one_hot})
Example 2: Cascaded Pre-Processing¶
This example shows how to apply multiple pre-processor in sequence to a single observation.
# @package wrappers
maze.core.wrappers.observation_preprocessing.preprocessing_wrapper.PreProcessingWrapper:
pre_processor_mapping:
- observation: rgb_image
_target_: maze.preprocessors.Rgb2GrayPreProcessor
keep_original: false
config:
rgb_dim: -1
- observation: rgb_image-rgb2gray
_target_: maze.preprocessors.ResizeImgPreProcessor
keep_original: false
config:
target_size: [96, 96]
transpose: false
- observation: rgb_image-rgb2gray-resize_img
_target_: maze.preprocessors.UnSqueezePreProcessor
keep_original: false
config:
dim: -3
Details:
Converts observation rgb_image into a gray scale image, then scales this gray scale image to size 96 x 96 pixel and finally inserts an additional dimension at index -3 to prepare the observation for CNN processing.
None of the intermediate observations is kept as we are only interested in the final result here.
Observations space after pre-processing: {rgb_image-rgb2gray-resize_img}).
Example 3: Using Custom Pre-Processors¶
In case the built-in pre-processors provided with Maze are not sufficient for your use case you can of course implement and add additional custom processors.
# @package wrappers
maze.core.wrappers.observation_preprocessing.preprocessing_wrapper.PreProcessingWrapper:
pre_processor_mapping:
- observation: rgb_image
_target_: my_project.preprocessors.custom.CustomPreProcessor
keep_original: true
config:
num_flatten_dims: 2
When adding a new pre-processor you (1) have to implement the PreProcessor interface and (2) make sure that it is accessible within your Python path. Besides that you only have to provide the reference path of the pre-processor to use.
Observations will be tagged with the filename of your custom preprocessor (e.g. rgb_image -> rgb_image-custom).
Example 4: Plain Python Configuration¶
If you are not working with the Maze command line interface but still want to reuse observation pre-processing directly within Python you can start with the code snippet below.
"""Contains an example showing how to use observation pre-processing directly from python."""
from maze.core.wrappers.maze_gym_env_wrapper import GymMazeEnv
from maze.core.wrappers.observation_preprocessing.preprocessing_wrapper import PreProcessingWrapper
# this is the pre-processor config as a python dict
config = {
"pre_processor_mapping": [
{"observation": "observation",
"_target_": "maze.preprocessors.Rgb2GrayPreProcessor",
"keep_original": False,
"config": {"rgb_dim": -1}},
]
}
# instantiate a maze environment
env = GymMazeEnv("CarRacing-v0")
# wrap the environment for observation pre-processing
env = PreProcessingWrapper.wrap(env, pre_processor_mapping=config["pre_processor_mapping"])
# after this step the training env yields pre-processed observations
pre_processed_obs = env.reset()
Built-in Pre-Processors¶
Maze already provides built-in pre-processors. You can find a list and further details on the functionality of the respective processors in the reference documentation.
Where to Go Next¶
After pre-processing your observations you might also want to normalize them for efficient neural network processing using the ObservationNormalizationWrapper.
Learn about more general environment wrappers.
Observation Normalization¶
For efficient RL training it is crucial that the inputs (e.g. observations) to our models (e.g. policy and value networks) follow a certain distribution and exhibit values living within a certain range. To ensure this precondition Maze provides general and customizable functionality for normalizing the observations returned by the respective environments via the ObservationNormalizationWrapper.

This means to gain access to observation normalization and to the features listed below you simply have to add the ObservationNormalizationWrapper to your wrapper stack in your Hydra configuration.
To get started you can also just copy one of our Hydra config snippets below or use it directly within Python.
List of Features¶
So far observation normalization supports:
Different normalization strategies ([mean-zero-std-one, range[0, 1], …)
Estimating normalization statistics from observations collected by interacting with the respective environment (prior to training)
Providing an action sampling policy for collecting these normalization statistics
Manually specification of normalization statistics in case they are know beforehand
Excluding observations such as action masks from normalization
Preserving these statistics for continuing a training run, subsequent rollouts or deploying an agent
Gym dictionary observation spaces
Extendability with custom observation normalization strategies on the fly
As not all of the features listed above might be required right from the beginning you can find Hydra config examples with increasing complexity below.
Example 1: Normalization with Estimated Statistics¶
This example applies default observation normalization to all observations with statistics automatically estimated via sampling.
# @package wrappers
maze.core.wrappers.observation_normalization.observation_normalization_wrapper.ObservationNormalizationWrapper:
# default behaviour
default_strategy: maze.normalization_strategies.MeanZeroStdOneObservationNormalizationStrategy
default_strategy_config:
clip_range: [~, ~]
axis: ~
default_statistics: ~
statistics_dump: statistics.pkl
sampling_policy:
_target_: maze.core.agent.random_policy.RandomPolicy
exclude: ~
manual_config: ~
Details:
Applies mean zero - standard deviation one normalization to all observations contained in the dictionary observation space
Does not clip observations after normalization
Does not compute individual normalization statistics along different axis of the observation vector / matrix
Dumps the normalization statistics to the file “statistics.pkl”
Estimates the required statistics from observations collected via random sampling
Does not exclude any observations from normalization
Does not provide any normalization statistics manually
Example 2: Normalization with Manual Statistics¶
In this example, we manually specify both the default normalization strategy and its corresponding default statistics. This is useful, e.g., when working with RGB pixel observation spaces. However, it requires to know the normalization statistics beforehand.
# @package wrappers
maze.core.wrappers.observation_normalization.observation_normalization_wrapper.ObservationNormalizationWrapper:
# default behaviour
default_strategy: maze.normalization_strategies.RangeZeroOneObservationNormalizationStrategy
default_strategy_config:
clip_range: [0, 1]
axis: ~
default_statistics:
min: 0
max: 255
statistics_dump: statistics.pkl
sampling_policy:
_target_: maze.core.agent.random_policy.RandomPolicy
exclude: ~
manual_config: ~
Details:
Add range-zero-one normalization with manually set statistics to all observations
Clips the normalized observation to range [0, 1] in case something goes wrong. (As this example expects RGB pixel observations to have values between 0 and 255 this should not have an effect.)
Subtracts 0 from each value contained in the observation vector / matrix and then divides it by 255
The remaining settings do not have an effect here
Example 3: Custom Normalization and Excluding Observations¶
This advanced example shows how to utilize the full feature set of observation normalization. For explanations please see the comments and details below.
# @package wrappers
maze.core.wrappers.observation_normalization.observation_normalization_wrapper.ObservationNormalizationWrapper:
# default behaviour
default_strategy: maze.normalization_strategies.MeanZeroStdOneObservationNormalizationStrategy
default_strategy_config:
clip_range: [~, ~]
axis: ~
default_statistics: ~
statistics_dump: statistics.pkl
sampling_policy:
_target_: maze.core.agent.random_policy.RandomPolicy
# observation with key action_mask gets excluded from normalization
exclude: [action_mask]
manual_config:
# observation pixel_image uses manually specified normalization statistics
pixel_image:
strategy: maze.normalization_strategies.RangeZeroOneObservationNormalizationStrategy
strategy_config:
clip_range: [0, 1]
axis: ~
statistics:
min: 0
max: 255
# observation feature_vector estimates normalization statistics via sampling
feature_vector:
strategy: maze.normalization_strategies.MeanZeroStdOneObservationNormalizationStrategy
strategy_config:
clip_range: [-3, 3]
# normalization statistics are computed along the first axis
axis: [0]
Details:
The default behaviour for observations without manual config is identical to example 1
observation pixel_image: behaves as the default in example 2
observation feature_vector:
By setting axis to [0] in the strategy_config each element in the observation gets normalized with an element-wise mean and standard deviation.
Why? A feature_vector has shape (d,). After collecting N observations for computing the normalization statistics we arrive at a stacked feature_vector-matrix with shape (N, 10). By computing the normalization statistics along axis [0] we get normalization statistics with shape (d,) again which can be applied in an elementwise fashion.
Additionally each element in the vector is clipped to range [-3, 3].
Note, that even though a manual config is provided for some observations you can still decide if you would like to use predefined manual statistics or estimate them from sampled observations.
Example 4: Using Custom Normalization Strategies¶
In case the normalization strategies provided with Maze are not sufficient for your use case you can of course implement and add your own strategies.
# @package wrappers
maze.core.wrappers.observation_normalization.observation_normalization_wrapper.ObservationNormalizationWrapper:
# default behaviour
default_strategy: my_project.normalization_strategies.custom.CustomObservationNormalizationStrategy
default_strategy_config:
clip_range: [~, ~]
axis: ~
default_statistics: ~
statistics_dump: statistics.pkl
sampling_policy:
_target_: maze.core.agent.random_policy.RandomPolicy
exclude: ~
manual_config: ~
When adding a new normalization strategy you (1) have to implement the ObservationNormalizationStrategy interface and (2) make sure that it is accessible within your Python path. Besides that you only have to provide the reference path of the pre-processor to use.
Example 5: Plain Python Configuration¶
If you are not working with the Maze command line interface but still want to reuse observation normalization directly within Python you can start with the code snippet below. It shows how to:
instantiate an observation normalized environment
estimate normalization statistics via sampling
reuse the estimated statistics for normalization for subsequent tasks such as training or rollouts
"""Contains an example showing how to use observation normalization directly from python."""
from maze.core.agent.random_policy import RandomPolicy
from maze.core.wrappers.maze_gym_env_wrapper import GymMazeEnv
from maze.core.wrappers.observation_normalization.observation_normalization_wrapper import \
ObservationNormalizationWrapper
from maze.core.wrappers.observation_normalization.observation_normalization_utils import \
obtain_normalization_statistics
# instantiate a maze environment
env = GymMazeEnv("CartPole-v0")
# this is the normalization config as a python dict
normalization_config = {
"default_strategy": "maze.normalization_strategies.MeanZeroStdOneObservationNormalizationStrategy",
"default_strategy_config": {"clip_range": (None, None), "axis": 0},
"default_statistics": None,
"statistics_dump": "statistics.pkl",
"sampling_policy": RandomPolicy(env.action_spaces_dict),
"exclude": None,
"manual_config": None
}
# 1. PREPARATION: first we estimate normalization statistics
# ----------------------------------------------------------
# wrap the environment for observation normalization
env = ObservationNormalizationWrapper.wrap(env, **normalization_config)
# before we can start working with normalized observations
# we need to estimate the normalization statistics
normalization_statistics = obtain_normalization_statistics(env, n_samples=1000)
# 2. APPLICATION (training, rollout, deployment)
# ----------------------------------------------
# instantiate a maze environment
training_env = GymMazeEnv("CartPole-v0")
# wrap the environment for observation normalization
training_env = ObservationNormalizationWrapper.wrap(training_env, **normalization_config)
# reuse the estimated the statistics in our training environment(s)
training_env.set_normalization_statistics(normalization_statistics)
# after this step the training env yields normalized observations
normalized_obs = training_env.reset()
Built-in Normalization Strategies¶
Normalization strategies simply specify the way how input observations are normalized.
Maze already comes with built-in normalization strategies. You can find a list and further details on the functionality of the respective strategies in the reference documentation.
The Bigger Picture¶
The figure below shows how observation normalization is embedded in the overall interaction loop and set the involved components into context.
It is located in between the
ObservationConversionInterface
(which converts environment MazeStates into machine readable observations) and the agent.

According to the sampling_policy specified in the config the wrapper collects observations from the interaction loop and uses these to estimate the normalization statistics given the provided normalization strategies.
The statistics get dumped to the pickle file specified in the config for subsequent rollouts or deploying the agent.
If normalization statistics are known beforehand this stage can be skipped by simply providing the statistics manually in the wrapper config.
Where to Go Next¶
Before normalizing your observations you first might want to pre-process them with the PreProcessingWrapper.
Learn about more general environment wrappers.
Tricks of the Trade¶
This page contains a short list of tips and best practices that have been quite useful in our work over the last couple of years and will hopefully also make it easier for you to train your agents. However, you should be aware that not each item below will work in each and every application scenario. Nonetheless, if you are stuck most of them are certainly worth to give a try!
Note
Below you find a subjective and certainly not complete collection of RL tips and tricks
that will hopefully continue to grow over time.
However, if you stumble upon something crucial
that is missing from the list, which you would like to share with the
RL community and us do not hesitate to get in touch and discuss with us!
Learning and Optimization¶
Use action masking whenever possible! This can be crucial as it has the potential to drastically reduce the exploration space of your problem, which usually leads to a reduced learning time and better overall results. In some cases action masking also mitigates the need for reward shaping as invalid actions are excluded from sampling and there is no need to penalize them with negative rewards any more. If you want to learn more we recommend to check out the tutorial on structured environments and action masking.
Make sure that your step rewards are in a reasonable range (e.g., [-1, 1]) not spanning various orders of magnitude.
If these conditions are not fulfilled you might want to apply reward scaling or clipping
(see RewardScalingWrapper
,
RewardClippingWrapper
)
or manually shape your reward.
Reward and Key Performance Indicator (KPI) Monitoring
When optimizing multi-target objectives (e.g., a weighted sum of sub-rewards) consider to monitor the contributing rewards on an individual basis. Even though the overall reward appears to not improve anymore it might still be the case that the contributing sub-rewards change or fluctuate in the background. This indicates that the policy and in turn the behaviour of your agent is still changing. In such settings we recommend to watch the learning progress by monitoring KPIs.
Models and Networks¶
Design use case and task specific custom network architectures whenever required. In a straight forward case this might be a CNN when processing image observations but it could also be a Graph Convolution Network (GCN) when working with graph or grid observations. To do so, you might want to check out the Perception Module, the built-in network building blocks as well as the section on how to work with custom models.
Further, you might want to consider behavioural cloning (BC) to design and tweak
the network architectures
the observations that are fed into these models
This requires that an imitation learning dataset fulfilling the pre-conditions for supervised learning is available. If so, incorporating BC into the model and observation design process can save a lot of time and compute as you are now training in a supervised learning setting. Intuition: If a network architecture, given the corresponding observations, is able to fit an offline trajectory dataset (without severe over-fitting) it might also be a good choice for actual RL training. If this is relevant to you, you can follow up on how to employ imitation learning with Maze.
When facing bounded continuous action spaces use
Squashed Gaussian
or
Beta
probability distributions for your action heads instead of an unbounded Gaussian.
This avoids action clipping and limits the space of explorable actions to valid regions.
You can learn in the section about
distributions and acton heads
how you can easily switch between different probability distributions using the
DistributionMapper
.
If you would like to incorporate prior knowledge about the selection frequency of certain actions
you could consider to bias the output layers of these action heads towards the expected sampling distribution
after randomly initializing the weights of your networks
(e.g., compute_sigmoid_bias
).
Observations¶
For efficient RL training it is crucial that the inputs (e.g. observations) to our models (e.g. policy and value networks) follow a certain distribution and exhibit values within certain ranges. To ensure this precondition consider to normalize your observations before actual training by either:
manually specifying normalization statistics (e.g, divide by 255 for uint8 RGB image observations)
compute statistics from observations sampled by interacting with the environment
As this is a recurring, boilerplate code heavy task, Maze already provides built-in customizable functionality for normalizing the observations.
When feeding categorical observations to your models
consider to convert them to their one-hot encoded vectorized counterparts.
This representation is better suited for neural network processing
and a common practice for example in Natural Language Processing (NLP).
In Maze you can achieve this via observation pre-processing and the
OneHotPreProcessor
.
Cheat Sheet¶
Run a rollout to test an environment with random action sampling:
maze-run -cn conf_rollout env.name=CartPole-v1 policy=random_policy
Run a rollout and render the state of the environment:
maze-run -cn conf_rollout env.name=CartPole-v1 policy=random_policy \
runner=sequential runner.render=true
Train a policy with evolutionary strategies (ES):
maze-run -cn conf_train env.name=CartPole-v1 algorithm=es model=vector_obs
Train a policy with with an actor-critic trainer such as A2C:
maze-run -cn conf_train env.name=CartPole-v1 algorithm=a2c \
model=vector_obs critic=template_state
Resume training from a previous model state:
maze-run -cn conf_train env.name=CartPole-v1 algorithm=a2c \
model=vector_obs critic=template_state input_dir=outputs/<experiment-dir>
Run a rollout of a policy, trained with the command above:
maze-run -cn conf_rollout env.name=CartPole-v1 model=vector_obs \
policy=torch_policy input_dir=outputs/<experiment-dir>
Integrating an Existing Gym Environment¶
Maze supports a seamless integration of existing OpenAI Gym environments. This holds for already registered, built-in Gym environments but also for any other custom environment following the Gym environments interface.
To get full Maze feature support for Gym environments we first have to transform them into Maze environments. This page shows how this is easily accomplished via the GymMazeEnv.

In short, a Gym environment is transformed into a
MazeEnv
by wrapping it with the
GymMazeEnv
.
Under the hood the GymMazeEnv
automatically:
Transforms the Gym environment into a
GymCoreEnv
.Transforms the observation and action spaces into a dictionary spaces via the
GymObservationConversion
andGymActionConversion
interfaces.Packs the GymCoreEnv into a
MazeEnv
which is fully compatible with all other Maze components and modules.
To get a better understanding of the overall structure please see the Maze environment hierarchy.
Instantiating a Gym Environment as a Maze Environment¶
The config snippet below shows how to instantiate an existing, already registered Gym environment as a GymMazeEnv referenced by its environment name (here CartPole-v0).
# @package env
_target_: maze.core.wrappers.maze_gym_env_wrapper.make_gym_maze_env
name: CartPole-v0
To achieve the same result directly with plain Python you can start with the code snippet below.
from maze.core.wrappers.maze_gym_env_wrapper import GymMazeEnv
env = GymMazeEnv(env="CartPole-v0")
In case your custom Gym environment is not yet registered with Gym, you can also explicitly instantiate the environment before passing it to the GymMazeEnv.
from maze.core.wrappers.maze_gym_env_wrapper import GymMazeEnv
custom_gym_env = CustomGymEnv()
env = GymMazeEnv(env=custom_gym_env)
Test your own Gym Environment with Maze¶
If you already have a project set up and would like to test Maze with your own environment the quickest way to get started is to:
First, make sure that your project is either installed or available in your PYTHONPATH
.
Second, add an environment factory function similar to the one shown in the snippet
below to your project (e.g., my_project/env_factory.py
).
from maze.core.env.maze_env import MazeEnv
from maze.core.wrappers.maze_gym_env_wrapper import GymMazeEnv
def make_env(name: str) -> MazeEnv:
custom_gym_env = CustomGymEnv()
return GymMazeEnv(custom_gym_env)
That’s all we need to do. You can now start training an agent for your environment by running:
$ maze-run -cn conf_train env._target_=my_project.env_factory.make_env
This basically updates the original gym_env config via Hydra overrides.
Note that the argument name
is unused so far but is required to adhere to the
gym_env config signature.
When creating your own config files you can of course tailor this signature to your needs.
Where to Go Next¶
For technical details on the GymMazeEnv please see the reference documentation.
Next you might be interested in how to train an agent for your environment.
You might also want to read up on the Maze environment hierarchy for the bigger picture.
Structured Environments and Action Masking¶
This tutorial provides a step by step guide explaining how to implement a decision problem
as a structured environment and how to train an agent for such a StructuredEnv
with a structured Maze Trainer.
The examples are again based on the online version of the Guillotine 2D Cutting Stock Problem
which is a perfect fit for introducing the underlying concepts.
In particular, we will see how to evolve the performance of an RL agent by going through the following stages:
Flat Gym-style environment with vanilla feed forward models
Structured environment (e.g., with hierarchical sub-steps) with task specific policy networks
Structured environment (e.g., with hierarchical sub-steps) with masked task specific policy networks

Before diving into this tutorial we recommend to familiarize yourself with Control Flows with Structured Environments and the basic Maze - step by step tutorial.
The remainder of this tutorial is structured as follows:
Turning a “flat” MazeEnv into a StructuredEnv¶
In this part of the tutorial we will learn how to reformulate an RL problem in order to turn it from a “flat” Gym-style environment into a structured environment.
The complete code for this part of the tutorial can be found here
# relevant files
- cutting_2d
- main.py
- env
- struct_env.py
Analyzing the Problem Structure¶
Before we start implementing the structured environment lets first revisit the cutting 2D problem. In particular, we put our attention to the joint action space consisting of the following components:
Action \(a_0\): cutting piece selection (decides which piece from inventory to use for cutting)
Action \(a_1\): cutting orientation selection (decides on the orientation of the cut)
Action \(a_2\): cutting order selection (decides which cut to take first; x or y)

Analysis of Action Space and Problem:
We are facing a combinatorial action space with \(O(N \cdot 2 \cdot 2)\) possible actions the agent has to choose from in each step. \(N\) is the maximum number of pieces stored in the inventory.
Sampling from this joint action space might result in invalid cutting configurations. This is because the three sub-actions are treated independently from each other. For the problem at hand this is obviously not the case.
It would be much more intuitive to sample the sub-actions sequentially and conditioned on each other. (E.g., it seems to be easier to decide on the cutting order and orientation once we know the piece we will cut from.)
Implementing the Structured Environment¶
We now address the issues discovered in the previous section and re-formulate the cutting 2D problem as a
StructuredEnv
with the following two sub-steps:
Select cutting piece from inventory given inventory state and customer order.
Select cutting configuration (cutting order and cutting orientation) given customer order and inventory cutting piece selected in the previous sub-step.
This could be also described with the modified agent environment interaction loop shown in the figure below. Note that the both observation and action space differ between the selection and the cutting sub-step. For the present example, reward is only granted once the cutting sub-step (i.e., the second step) is complete.

Note
Conceptually structured environments and conditional sub-steps are related to auto-regressive action spaces where subsequent actions are sampled conditioned on their predecessors. [e.g. DeepMind (2019), “Grandmaster level in StarCraft II using multi-agent reinforcement learning.”]
The code for the StructuredCutting2DEnvironment
below implements exactly this interaction pattern.
from copy import deepcopy
from typing import Dict, Any, Union, Tuple, Optional, List
import gym
import numpy as np
from maze.core.env.maze_action import MazeActionType
from maze.core.env.maze_env import MazeEnv
from maze.core.env.maze_state import MazeStateType
from maze.core.env.structured_env import StructuredEnv, ActorID
from maze.core.env.structured_env_spaces_mixin import StructuredEnvSpacesMixin
from maze.core.wrappers.wrapper import Wrapper
from .maze_env import maze_env_factory
class StructuredCutting2DEnvironment(Wrapper[MazeEnv], StructuredEnv, StructuredEnvSpacesMixin):
"""Structured environment version of the cutting 2D environment.
The environment alternates between the two sub-steps:
- Select cutting piece
- Select cutting configuration (cutting order and cutting orientation)
:param maze_env: The "flat" cutting 2D environment to wrap.
"""
def __init__(self, maze_env: MazeEnv):
Wrapper.__init__(self, maze_env)
# define sub-step action spaces
self._action_spaces_dict = {
0: gym.spaces.Dict({"piece_idx": maze_env.action_space["piece_idx"]}),
1: gym.spaces.Dict({"cut_rotation": maze_env.action_space["cut_rotation"],
"cut_order": maze_env.action_space["cut_order"]})
}
# define sub-step observation spaces
flat_space = maze_env.observation_space
self._observation_spaces_dict = {
0: flat_space,
1: gym.spaces.Dict({"selected_piece": flat_space["ordered_piece"],
"ordered_piece": flat_space["ordered_piece"]})
}
self._flat_obs = None
self._action_0 = None
self._sub_step_key = 0
self._last_reward = None # Last reward obtained from the underlying environment
def step(self, action):
"""Generic step function alternating between the two sub-steps.
:return: obs, rew, done, info
"""
# sub-step: Select cutting piece
if self._sub_step_key == 0:
sub_step_result = self._selection_step(action)
# sub-step: Select cutting configuration
elif self._sub_step_key == 1:
sub_step_result = self._cutting_step(action)
else:
raise ValueError("Sub-step id {} not allowed for this environment!".format(self._sub_step_key))
# alternate step index
self._sub_step_key = np.mod(self._sub_step_key + 1, 2)
return sub_step_result
def reset(self) -> Any:
"""Resets the environment and returns the initial state.
:return: The initial state after resetting.
"""
self._flat_obs = self.env.reset()
self._flat_obs["ordered_piece"] = self._flat_obs["ordered_piece"]
self._sub_step_key = 0
return self._obs_selection_step(self._flat_obs)
@staticmethod
def _obs_selection_step(flat_obs: Dict[str, np.array]) -> Dict[str, np.array]:
"""Formats initial observation / observation available for the first sub-step."""
return deepcopy(flat_obs)
@staticmethod
def _obs_cutting_step(flat_obs: Dict[str, np.array], selected_piece_idx: int) -> Dict[str, np.array]:
"""Formats observation available for the second sub-step."""
return {"selected_piece": flat_obs["inventory"][selected_piece_idx],
"ordered_piece": flat_obs["ordered_piece"]}
def _selection_step(self, action: Dict[str, int]) -> Tuple[Dict[str, np.ndarray], float, bool, Dict]:
"""Cutting piece selection step."""
self._action_0 = action
obs = self._obs_cutting_step(self._flat_obs, action["piece_idx"])
return obs, 0.0, False, {}
def _cutting_step(self, action: Dict[str, int]) -> Tuple[Dict[str, np.ndarray], float, bool, Dict]:
"""Cutting rotation and cutting order selection step."""
flat_action = {"piece_idx": self._action_0["piece_idx"],
"cut_rotation": action["cut_rotation"],
"cut_order": action["cut_order"]}
self._flat_obs, self._last_reward, done, info = self.env.step(flat_action)
self._flat_obs["ordered_piece"] = self._flat_obs["ordered_piece"]
return self._obs_selection_step(self._flat_obs), self._last_reward, done, info
def actor_id(self) -> ActorID:
"""Returns the currently executed actor along with the policy id. The id is unique only with
respect to the policies (every policy has its own actor 0).
Note that identities of done actors can not be reused in the same rollout.
:return: The current actor, as tuple (policy id, actor number).
"""
return ActorID(step_key=self._sub_step_key, agent_id=0)
def get_actor_rewards(self) -> Optional[np.ndarray]:
"""Returns rewards attributed to individual actors after the step has been done. This is necessary,
as after the first sub-step (i.e., piece selection), the full reward is not yet available, so zero
reward is returned instead. The second (= last) sub-step then returns joint reward for all (both) actors.
With this method, we can attribute parts of the reward to the individual actors, which is useful for example
if each has its own separate critic.
In this case, we attribute half of the reward to each actor.
"""
return np.array([self._last_reward / 2.0] * 2)
@property
def agent_counts_dict(self) -> Dict[Union[str, int], int]:
"""Returns the count of agents for individual sub-steps (or -1 for dynamic agent count).
This env has two sub-steps (0 and 1), in each of which one agent gets to act. Hence, we return
{0: 1, 1: 1}.
"""
return {0: 1, 1: 1}
def is_actor_done(self) -> bool:
"""Returns True if the just stepped actor is done, which is different to the done flag of the environment."""
return False
@property
def action_space(self) -> gym.spaces.Dict:
"""Implementation of :class:`~maze.core.env.structured_env_spaces_mixin.StructuredEnvSpacesMixin` interface."""
return self._action_spaces_dict[self._sub_step_key]
@property
def observation_space(self) -> gym.spaces.Dict:
"""Implementation of :class:`~maze.core.env.structured_env_spaces_mixin.StructuredEnvSpacesMixin` interface."""
return self._observation_spaces_dict[self._sub_step_key]
@property
def action_spaces_dict(self) -> Dict[Union[int, str], gym.spaces.Dict]:
"""Implementation of :class:`~maze.core.env.structured_env_spaces_mixin.StructuredEnvSpacesMixin` interface."""
return self._action_spaces_dict
@property
def observation_spaces_dict(self) -> Dict[Union[int, str], gym.spaces.Dict]:
"""Implementation of :class:`~maze.core.env.structured_env_spaces_mixin.StructuredEnvSpacesMixin` interface."""
return self._observation_spaces_dict
def seed(self, seed: int = None) -> None:
"""Sets the seed for this environment's random number generator(s).
:param: seed: the seed integer initializing the random number generator.
"""
self.env.seed(seed)
def close(self) -> None:
"""Performs any necessary cleanup."""
self.env.close()
def get_observation_and_action_dicts(self, maze_state: MazeStateType, maze_action: MazeActionType,
first_step_in_episode: bool) \
-> Tuple[Optional[Dict[Union[int, str], Any]], Optional[Dict[Union[int, str], Any]]]:
"""Convert the flat action and MazeAction from Maze env into the structured ones.
Note that both MazeState and MazeAction needs to be supplied together, otherwise actions/observations for the
individual sub-steps cannot be produced.
"""
assert maze_state is not None and maze_action is not None,\
"This wrapper needs both MazeState and MazeAction for the conversion (as there are multiple sub-steps)."
observation_dict, action_dict = self.env.get_observation_and_action_dicts(maze_state, maze_action,
first_step_in_episode)
assert len(observation_dict.items()) == 1 and len(action_dict.items()) == 1, "wrapped env should be single-step"
flat_action = list(action_dict.values())[0]
flat_obs = list(observation_dict.values())[0]
flat_obs["ordered_piece"] = flat_obs["ordered_piece"]
obs_dict = {
0: self._obs_selection_step(flat_obs),
1: self._obs_cutting_step(flat_obs, flat_action["piece_idx"])
}
act_dict = {
0: {k: flat_action[k] for k in ["piece_idx"]},
1: {k: flat_action[k] for k in ["cut_rotation", "cut_order"]}
}
return obs_dict, act_dict
def struct_env_factory(max_pieces_in_inventory: int, raw_piece_size: Tuple[int, int],
static_demand: List[Tuple[int, int]]) -> StructuredCutting2DEnvironment:
"""Convenience factory function that compiles a trainable structured environment.
(for argument details see: Cutting2DEnvironment)
"""
# init maze environment including observation and action interfaces
env = maze_env_factory(max_pieces_in_inventory=max_pieces_in_inventory,
raw_piece_size=raw_piece_size,
static_demand=static_demand)
# convert flat to structured environment
return StructuredCutting2DEnvironment(env)
Test Script¶
The following snippet first instantiates the structured environment and then performs one cycle of the structured agent environment interaction loop.
""" Test script CoreEnv """
from tutorial_maze_env.part06_struct_env.env.struct_env import struct_env_factory
def main():
# init maze environment including observation and action interfaces
struct_env = struct_env_factory(max_pieces_in_inventory=200,
raw_piece_size=(100, 100),
static_demand=[(30, 15)])
# reset env
obs_step1 = struct_env.reset()
print("action_space 1: ", struct_env.action_space)
print("observation_space 1:", struct_env.observation_space)
print("observation 1: ", obs_step1.keys())
# take first env step
action_1 = struct_env.action_space.sample()
obs_step2, rew, done, info = struct_env.step(action=action_1)
print("action_space 2: ", struct_env.action_space)
print("observation_space 2:", struct_env.observation_space)
print("observation 2: ", obs_step2.keys())
# take second env step
action_2 = struct_env.action_space.sample()
obs_step1 = struct_env.step(action=action_2)
if __name__ == "__main__":
""" main """
main()
Running the script will print the following output. Note that the observation and action spaces alternate from sub-step to sub-step.
action_space 1: Dict(piece_idx:Discrete(200))
observation_space 1: Dict(inventory:Box(200, 2), inventory_size:Box(1,), order:Box(2,))
observation 1: dict_keys(['inventory', 'inventory_size', 'order'])
action_space 2: Dict(order:Discrete(2), rotation:Discrete(2))
observation_space 2: Dict(order:Box(2,), selected_piece:Box(1, 2))
observation 2: dict_keys(['selected_piece', 'order'])
In the next part of this tutorial we will train an agent on this structured environment.
Training the Structured Environment¶
In this part of the tutorial we will learn how to train an agent with a Maze trainer implicitly supporting a Structured Environment. We will also design a policy network architecture matching the task at hand.
The complete code for this part of the tutorial can be found here
# relevant files
- cutting_2d
- conf
- env
- tutorial_cutting_2d_flat.yaml
- tutorial_cutting_2d_struct.yaml
- model
- tutorial_cutting_2d_flat.yaml
- tutorial_cutting_2d_struct.yaml
- wrappers
- tutorial_cutting_2d.yaml
- models
- actor.py
- critic.py
A Simple Problem Setting¶
To emphasize the effects of action masking throughout this tutorial we devise a simple problem instance of the cutting 2d environment with the following properties:

Given the raw piece size and the items in the static demand (appear in an alternating fashion) we can cut six customer orders from one raw inventory piece. When limiting the episode length to 180 time steps the optimal solution with respect to new raw pieces from inventory is 31 (30 + 1 because the environment adds a new piece whenever the current one has been cut).
Task-Specific Actor-Critic Model¶
For this advanced tutorial we make use of Maze custom models
to compose an actor-critic architecture that is geared towards the respective sub-tasks.
Our structured environment requires two policies, one for piece selection and one for cutting parametrization.
For each of the two sub-step policies we also build a distinct state critic
(see StepStateCriticComposer
for details).
Note that it would be also possible to employ a
SharedStateCriticComposer
used to compute the advantages for both policies.
The images below show the for network architectures (click to view in large). For further details on how to build the models we refer to the accompanying repository and the section on how to work with custom models.
Piece Selection Policy
|
Cutting Policy
|
Piece Selection Critic
|
Cutting Critic
|
![]() |
![]() |
![]() |
![]() |
Some notes on the models:
The selection policy takes the current inventory and the ordered piece as input and predicts a selection probability (piece_idx) for each inventory option.
The cutting policy takes the ordered piece and the selected piece (previous step) as input and predicts cutting rotation and cutting order.
The critic models have an analogous structure but predict the state-value instead of action logits.
Multi-Step Training¶
Given the models designed in the previous section we are now ready to train our first agent on a Structured Environment.
We already mentioned that Maze trainers directly support the training of Structured Environments
such as the StructuredCutting2DEnvironment
implemented in the previous part of this tutorial.
To start training a cutting policy with the PPO trainer, run:
maze-run -cn conf_train env=tutorial_cutting_2d_struct wrappers=tutorial_cutting_2d \
model=tutorial_cutting_2d_struct algorithm=ppo
As usual, we can watch the training progress with Tensorboard.
tensorboard --logdir outputs

We can see that the reward slowly approaches the optimum. Note that the performance of this agent is already much better than the vanilla Gym-style model we employed in the introductory tutorial (compare evolution of rewards above).
However, the event logs also reveal that the agent initially samples many invalid actions (e.g, invalid_cut and invalid_piece_selected). This is sample inefficient and slows down the learning progress.
Next, we will further improve the agent by avoiding sampling of these invalid choices via action masking.
Adding Step-Conditional Action Masking¶
In this part of the tutorial we will learn how to substantially increase the sample efficiency of our agents by adding sub-step conditional action masking to the structured environment.
The complete code for this part of the tutorial can be found here
# relevant files
- cutting_2d
- main.py
- env
- struct_env_masked.py
Page Overview
In particular, we will add two different masks:
Inventory_mask: allows to only select cutting pieces from inventory slots actually holding a piece that would allow to fulfill the customer order.
Rotation_mask: allows to only specify valid cutting rotations (e.g., the ordered piece fits into the cutting piece from inventory). Note that providing this mask is only possible once the cutting piece has been selected in the first sub-step - hence the name step-conditional masking.
The figure below provides a sketch of the two masks.

Only the first two inventory pieces are able to fit the customer order. The four rightmost inventory slots do not hold a piece at all and are also masked. When rotating the piece by 90° for cutting the customer order would not fit into the selected inventory piece which is why we can simply mask this option.
Masked Structured Environment¶
One way to incorporate the two masks in our structured environment is to simply inherit from the initial version and extend it by the following changes:
Add the two masks to the observation spaces (e.g.,
inventory_mask
andcutting_mask
)Compute the actual mask for the two sub-steps in the respective functions (e.g.,
_obs_selection_step
and_obs_cutting_step
).
from copy import deepcopy
from typing import Dict, List, Tuple
import gym
import numpy as np
from tutorial_maze_env.part06_struct_env.env.maze_env import maze_env_factory
from tutorial_maze_env.part06_struct_env.env.struct_env import StructuredCutting2DEnvironment
from maze.core.env.maze_env import MazeEnv
class MaskedStructuredCutting2DEnvironment(StructuredCutting2DEnvironment):
"""Structured environment version of the cutting 2D environment.
The environment alternates between the two sub-steps:
- Select cutting piece
- Select cutting configuration (cutting order and cutting orientation)
:param maze_env: The "flat" cutting 2D environment to wrap.
"""
def __init__(self, maze_env: MazeEnv):
super().__init__(maze_env)
# add masks to observation spaces
max_inventory = self.observation_conversion.max_pieces_in_inventory
self._observation_spaces_dict[0].spaces["inventory_mask"] = \
gym.spaces.Box(low=np.float32(0), high=np.float32(1), shape=(max_inventory,), dtype=np.float32)
self._observation_spaces_dict[1].spaces["cutting_mask"] = \
gym.spaces.Box(low=np.float32(0), high=np.float32(1), shape=(2,), dtype=np.float32)
@staticmethod
def _obs_selection_step(flat_obs: Dict[str, np.array]) -> Dict[str, np.array]:
"""Formats initial observation / observation available for the first sub-step."""
observation = deepcopy(flat_obs)
# prepare inventory mask
sorted_order = np.sort(observation["ordered_piece"].flatten())
sorted_inventory = np.sort(observation["inventory"], axis=1)
observation["inventory_mask"] = np.all(observation["inventory"] > 0, axis=1).astype(np.float32)
for i in np.nonzero(observation["inventory_mask"])[0]:
# exclude pieces which do not fit
observation["inventory_mask"][i] = np.all(sorted_order <= sorted_inventory[i])
return observation
@staticmethod
def _obs_cutting_step(flat_obs: Dict[str, np.array], selected_piece_idx: int) -> Dict[str, np.array]:
"""Formats observation available for the second sub-step."""
selected_piece = flat_obs["inventory"][selected_piece_idx]
ordered_piece = flat_obs["ordered_piece"]
# prepare cutting action mask
cutting_mask = np.zeros((2,), dtype=np.float32)
selected_piece = selected_piece.squeeze()
if np.all(flat_obs["ordered_piece"] <= selected_piece):
cutting_mask[0] = 1.0
if np.all(flat_obs["ordered_piece"][::-1] <= selected_piece):
cutting_mask[1] = 1.0
return {"selected_piece": selected_piece,
"ordered_piece": ordered_piece,
"cutting_mask": cutting_mask}
def struct_env_factory(max_pieces_in_inventory: int, raw_piece_size: Tuple[int, int],
static_demand: List[Tuple[int, int]]) -> StructuredCutting2DEnvironment:
"""Convenience factory function that compiles a trainable structured environment.
(for argument details see: Cutting2DEnvironment)
"""
# init maze environment including observation and action interfaces
env = maze_env_factory(max_pieces_in_inventory=max_pieces_in_inventory,
raw_piece_size=raw_piece_size,
static_demand=static_demand)
# convert flat to structured environment
return MaskedStructuredCutting2DEnvironment(env)
Test Script¶
When re-running the main script of the previous section with the masked version of the structured environment we now get the following output:
action_space 1: Dict(piece_idx:Discrete(200))
observation_space 1: Dict(inventory:Box(200, 2), inventory_size:Box(1,), ordered_piece:Box(2,), inventory_mask:Box(200,))
observation 1: dict_keys(['inventory', 'inventory_size', 'ordered_piece', 'inventory_mask'])
action_space 2: Dict(cut_order:Discrete(2), cut_rotation:Discrete(2))
observation_space 2: Dict(ordered_piece:Box(2,), selected_piece:Box(2,), cutting_mask:Box(2,))
observation 2: dict_keys(['selected_piece', 'ordered_piece', 'cutting_mask'])
As expected, both masks are contained in the respective observations and spaces. In the next section we will utilize these masks to enhance the sample efficiency ouf our trainers.
Training with Action Masking¶
In this part of the tutorial we will retrain the environment with step-conditional action masking activated and benchmark it with the initial, unmasked version.
The complete code for this part of the tutorial can be found here
# relevant files
- cutting_2d
- conf
- env
- tutorial_cutting_2d_flat.yaml
- tutorial_cutting_2d_struct.yaml
- tutorial_cutting_2d_struct_masked.yaml
- model
- tutorial_cutting_2d_flat.yaml
- tutorial_cutting_2d_struct.yaml
- tutorial_cutting_2d_struct_masked.yaml
- wrappers
- tutorial_cutting_2d.yaml
- models
- actor.py
- critic.py
Masked Policy Models¶
Before we can retrain the masked version of the structured environment we first need to specify
how the masks are employed within the models.
For this purpose we extend the two policy models with an
ActionMaskingBlock
applied to the respective logits. The resulting models are shown below:
Masked Piece
Selection Policy
|
Masked Cutting Policy
|
Piece Selection Critic
|
Cutting Critic
|
![]() |
![]() |
![]() |
![]() |
Retraining with Masking¶
maze-run -cn conf_train env=tutorial_cutting_2d_struct_masked wrappers=tutorial_cutting_2d \
model=tutorial_cutting_2d_struct_masked algorithm=ppo
Once training has finished we can again inspect the progress with Tensorboard. To get a better feeling for the effect of action masking we benchmark the following versions of the environment:
Flat Gym-style environment with vanilla feed forward models (red)
Structured Environment (e.g., with hierarchical sub-steps) with task specific policy networks (orange)
Structured Environment (e.g., with hierarchical sub-steps) with masked, task specific policy networks (blue)

First of all we can observe a massive increase in learning speed when activating action masking. In fact the reward of the masked model starts at an much higher initial value. We can also observe a substantial improvement when switching from the vanilla feed forward Gym-style example (red) to the structured environment using task specific custom models (orange).
In Depth Inspection of Learning Progress¶
In this section we make use of Maze Event Logging System to learn more about the learning progress and behaviour of the respective versions.

When looking at the cutting events we see that the agent utilizing action masking only performs valid cutting attempts right from the beginning of the training process. Avoiding the part where the agent has to learn via reward shaping which cuts are actually possible allows it to focus on learning how to cut efficiently. For the two other versions exactly the latter is the case.
The same observation holds for the piece selection policy where again a lot of invalid attempts take place for the two unmasked versions.
Finally, when looking at the inventory statistics we can see that the masked agent keeps very few pieces in inventory (pieces in inventory) which is why it never has to discard any piece (pieces discarded) that might be required to fulfill upcoming customer orders.
Turning a “flat” MazeEnv into a StructuredEnv
We will reformulate the problem from a “flat” Gym-style environment into a structured environment.
Training the Structured Environment
We will train the structured environment with a Maze Trainer.
Adding Step-Conditional Action Masking
We will learn how to substantially increase the sample efficiency by adding step-conditional action masking.
Training with Action Masking
We will retrain the structured environment with step-conditional action masking activated
and benchmark it with the initial version environment.
Combining Maze with other RL Frameworks¶
This tutorial explains how to use general Maze features in combination with existing RL frameworks. In particular, we will apply observation normalization before optimizing a policy with the stable-baselines3 A2C trainer. When adding new features to Maze we put a strong emphasis on reusablity to allow you to make use of as much of these features as possible but still give you the opportunity to stick to the optimization framework you are most comfortable or familiar with.
Since RLlib already has a dedicated spot within Maze we rely on stable-baselines3 for this tutorial. However, it is important to note that the examples below will also work with any other Python-based RL framework compatible with Gym environments.
We provide two different versions showing how to arrive at an observation normalized environment. The first one is written in plain Python where the second reproduces the Python example with a Hydra configuration.
Note
Although, this tutorial explains how to reuse observation normalization there is of course no limitation to this sole feature. So if you find this useful we definitely recommend you to browse through our Environment Customization section in the sidebar.
Reusing Environment Customization Features¶
The basis for this tutorial is the official getting started snippet of stable-baselines showing how to train and run A2C on a CartPole environment. We added a few comments to make things a bit more explicit.
If you would like to run this example yourself make sure to install stable-baselines3 first.
"""
Getting started example from:
https://stable-baselines3.readthedocs.io/en/master/guide/quickstart.html
"""
import gym
from stable_baselines3 import A2C
# ENV INSTANTIATION
# -----------------
env = gym.make('CartPole-v0')
# TRAINING AND ROLLOUT
# --------------------
model = A2C('MlpPolicy', env, verbose=1)
model.learn(total_timesteps=10000)
obs = env.reset()
for i in range(1000):
action, _state = model.predict(obs, deterministic=True)
obs, reward, done, info = env.step(action)
env.render()
if done:
obs = env.reset()
Below you find exactly the same example but with an observation normalized environment. The following modifications compared to the example above are required:
Instantiate a GymMazeEnv instead of a standard Gym environment
Wrap the environment with the ObservationNormalizationWrapper
Estimate normalization statistics from actual environment interactions
As you might already have experienced, re-coding these steps for different environments and experiments can get quite cumbersome. The wrapper also dumps the estimated statistics in a file (statistics.pkl) to reuse them later on for agent deployment.
"""
Contains an example showing how to train
an observation normalized maze environment with stable-baselines.
"""
from maze.core.agent.random_policy import RandomPolicy
from maze.core.wrappers.maze_gym_env_wrapper import GymMazeEnv
from maze.core.wrappers.no_dict_spaces_wrapper import NoDictSpacesWrapper
from maze.core.wrappers.observation_normalization.observation_normalization_utils import \
obtain_normalization_statistics
from maze.core.wrappers.observation_normalization.observation_normalization_wrapper import \
ObservationNormalizationWrapper
from stable_baselines3 import A2C
# ENV INSTANTIATION: a GymMazeEnv instead of a gym.Env
# ----------------------------------------------------
env = GymMazeEnv('CartPole-v0')
# OBSERVATION NORMALIZATION
# -------------------------
# we wrap the environment with the ObservationNormalizationWrapper
# (you can find details on this in the section on observation normalization)
env = ObservationNormalizationWrapper(
env=env,
default_strategy="maze.normalization_strategies.MeanZeroStdOneObservationNormalizationStrategy",
default_strategy_config={"clip_range": (None, None), "axis": 0},
default_statistics=None, statistics_dump="statistics.pkl",
sampling_policy=RandomPolicy(env.action_spaces_dict),
exclude=None, manual_config=None)
# next we estimate the normalization statistics by
# (1) collecting observations by randomly sampling 1000 transitions from the environment
# (2) computing the statistics according to the define normalization strategy
normalization_statistics = obtain_normalization_statistics(env, n_samples=1000)
env.set_normalization_statistics(normalization_statistics)
# after this step all observations returned by the environment will be normalized
# stable-baselines does not support dict spaces so we have to remove them
env = NoDictSpacesWrapper(env)
# TRAINING AND ROLLOUT (remains unchanged)
# ----------------------------------------
model = A2C('MlpPolicy', env, verbose=1)
model.learn(total_timesteps=10000)
obs = env.reset()
for i in range(1000):
action, _state = model.predict(obs, deterministic=True)
obs, reward, done, info = env.step(action)
env.render()
if done:
obs = env.reset()
Reusing the Hydra Configuration System¶
This example is identical to the the previous one but instead of instantiated everything directly from Python it utilizes the Hydra configuration system.
"""
Contains an example showing how to train an observation normalized maze environment
instantiated from a hydra config with stable-baselines.
"""
from maze.core.utils.config_utils import make_env_from_hydra
from maze.core.wrappers.no_dict_spaces_wrapper import NoDictSpacesWrapper
from maze.core.wrappers.observation_normalization.observation_normalization_utils import \
obtain_normalization_statistics
from stable_baselines3 import A2C
# ENV INSTANTIATION: from hydra config file
# -----------------------------------------
env = make_env_from_hydra("conf")
# OBSERVATION NORMALIZATION
# -------------------------
# next we estimate the normalization statistics by
# (1) collecting observations by randomly sampling 1000 transitions from the environment
# (2) computing the statistics according to the define normalization strategy
normalization_statistics = obtain_normalization_statistics(env, n_samples=1000)
env.set_normalization_statistics(normalization_statistics)
# stable-baselines does not support dict spaces so we have to remove them
env = NoDictSpacesWrapper(env)
# TRAINING AND ROLLOUT (remains unchanged)
# ----------------------------------------
model = A2C('MlpPolicy', env, verbose=1)
model.learn(total_timesteps=10000)
obs = env.reset()
for i in range(1000):
action, _state = model.predict(obs, deterministic=True)
obs, reward, done, info = env.step(action)
env.render()
if done:
obs = env.reset()
This is the corresponding hydra config:
# @package _global_
# defines environment to instantiate
env:
_target_: maze.core.wrappers.maze_gym_env_wrapper.GymMazeEnv
env: "CartPole-v0"
# defines wrappers to apply
wrappers:
# Observation Normalization Wrapper
maze.core.wrappers.observation_normalization.observation_normalization_wrapper.ObservationNormalizationWrapper:
default_strategy: maze.normalization_strategies.MeanZeroStdOneObservationNormalizationStrategy
default_strategy_config:
clip_range: [~, ~]
axis: 0
default_statistics: ~
statistics_dump: statistics.pkl
sampling_policy:
_target_: maze.core.agent.random_policy.RandomPolicy
exclude: ~
manual_config: ~
Where to Go Next¶
You can learn more about the Hydra configuration system.
As observation normalization is not the scope of this section we recommend to read up on this in the dedicated section.
You might be also interested in observation pre-processing and the remaining environment customization options (see sidebar Environment Customization).
You can also check out the built-in Maze Trainers with full dictionary space support for observations and actions.
You can also make use of the full Maze environment hierarchy.
Plain Python Training Example (high-level)¶
This tutorial demonstrates how to train an A2C agent with Maze in plain Python utilizing RunContext
. In the process it introduces and explains some of Maze’ most important components and concepts, going from a plain vanilla setup to an increasingly customized configuration.
This is complementary to the article on low-level training in plain Python, which guides through the same setup (but without RunContext
support).
Environment Setup¶
We will first prepare our environment for use with Maze. In order to use Maze’s parallelization capabilities, it is necessary to define a factory function that returns a MazeEnv
of your environment. This is easily done for Gym environments:
def cartpole_env_factory():
""" Env factory for the cartpole MazeEnv """
# Registered gym environments can be instantiated first and then provided to GymMazeEnv:
cartpole_env = gym.make("CartPole-v0")
maze_env = GymMazeEnv(env=cartpole_env)
# Another possibility is to supply the gym env string to GymMazeEnv directly:
maze_env = GymMazeEnv(env="CartPole-v0")
return maze_env
env = cartpole_env_factory()
If you have your own environment (that is not a gym.Env
) you must transform it into a MazeEnv yourself, as is shown here, and have your factory return that. If it is a custom gym env it can be instantiated with our wrapper as shown above.
Algorithm Setup¶
We use A2C for this example. The algorithm_config for A2C can be found here. The hyperparameters will be supplied to Maze with an algorithm-dependent AlgorithmConfig object. The one for A2C is A2CAlgorithmConfig
. We will use the default parameters, which can also be found here.
algorithm_config = A2CAlgorithmConfig(
n_epochs=5,
epoch_length=25,
patience=15,
critic_burn_in_epochs=0,
n_rollout_steps=100,
lr=0.0005,
gamma=0.98,
gae_lambda=1.0,
policy_loss_coef=1.0,
value_loss_coef=0.5,
entropy_coef=0.00025,
max_grad_norm=0.0,
device='cpu',
rollout_evaluator=RolloutEvaluator(
eval_env=SequentialVectorEnv([cartpole_env_factory]),
n_episodes=1,
model_selection=None,
deterministic=True
)
)
Having defined our environment and configured our algorithm we’re ready to train:
rc = maze.api.run_context.RunContext(env=cartpole_env_factory, algorithm=algorithm_config)
rc.train()
Custom Model Setup¶
However, it can be advisable to create customized networks taking full advantage of the available data. For this reason Maze supports plugging in customized policy and value networks.
Our goal is to hence train an agent with A2C using customized policy and critic networks:
rc = maze.api.run_context.RunContext(
env=cartpole_env_factory,
algorithm=algorithm_config,
policy=...,
critic=...
)
rc.train()
Here we will pay special attention to emphasize the format required by Maze. When creating your own models, it is important to know three things:
Maze works with dictionaries throughout, which means that arguments for the constructor and the input and return values of the forward method are dicts with user-defined keys. In a nutshell, instances of
MazeEnv
can have different steps indicating the currently active task. Each step is associated with a policy, so an environment with different steps can also have different policy. By default environments have only step 0. The required format for models is explained in more detail here.Policy networks and value network constructors have required arguments: for policy nets, these are obs_shapes and action_logit_dicts, for value nets, this is obs_shapes.
Policies and critics are not passed directly, but via composer objects - i.e. classes of type
BasePolicyComposer
orCriticComposerInterface
, respectively. Such composer classes are able to generate policy instances.
Policy Customization
To instantiate e.g. a ProbabilisticPolicyComposer
, we require the following arguments:
The policy network.
A specification of the probability distribution as an instance of
DistributionMapper
.Dictionaries describing the action and observation spaces.
The numbers of agents active in the corresponding steps.
The IDs of substeps in which agents do not share the same networks.
Policy Network. First, let us create the latter as a simple linear mapping network with the required constraints:
class CartpolePolicyNet(nn.Module):
""" Simple linear policy net for demonstration purposes. """
def __init__(self, obs_shapes: Dict[str, Sequence[int]], action_logit_shapes: Dict[str, Sequence[int]]):
super().__init__()
self.net = nn.Sequential(
nn.Linear(
in_features=obs_shapes['observation'][0],
out_features=action_logit_shapes['action'][0]
)
)
def forward(self, x_dict: Dict[str, torch.Tensor]) -> Dict[str, torch.Tensor]:
# Since x_dict has to be a dictionary in Maze, we extract the input for the network.
x = x_dict['observation']
# Do the forward pass.
logits = self.net(x)
# Since the return value has to be a dict again, put the
# forward pass result into a dict with the correct key.
logits_dict = {'action': logits}
return logits_dict
# Instantiate our custom policy net.
policy_net = CartpolePolicyNet(
obs_shapes={'observation': env.observation_space.spaces['observation'].shape},
action_logit_shapes={'action': (env.action_space.spaces['action'].n,)}
)
Optionally, we can wrap our policy network with a TorchModelBlock
, which applies shape normalization (see ShapeNormalizationBlock
):
policy_net = TorchModelBlock(
in_keys='observation',
out_keys='action',
in_shapes=env.observation_space.spaces['observation'].shape,
in_num_dims=[2],
out_num_dims=2,
net=policy_net
)
Since Maze offers the capability of supporting multiple actors, we need to map each policy_net
to its corresponding actor ID. As we have only one, this mapping is trivial:
policy_networks = [policy_net] # Alternative: {0: policy_net}
Policy Distribution. Initializing the proper probability distribution for the policy is rather easy with Maze.
Simply provide the ~maze.distributions.distribution_mapper.DistributionMapper
with the environment’s action space and you automatically get the proper distribution to use.
distribution_mapper = DistributionMapper(action_space=env.action_space, distribution_mapper_config={})
Optionally, you can specify a different distribution with the distribution_mapper_config
argument. Using a
CategoricalProbabilityDistribution
for a discrete action space would be done with
distribution_mapper = DistributionMapper(
action_space=action_space,
distribution_mapper_config=[{
"action_space": gym.spaces.Discrete,
"distribution": "maze.distributions.categorical.CategoricalProbabilityDistribution"}])
Since the standard distribution taken by Maze for a discrete action space is a Categorical distribution anyway (as can be seen here), both definitions of distribution_mapper
have the same result. For more information about the DistributionMapper, see Action Spaces and Distributions.
Policy Composer. The remaining arguments (action and observation space dictionaries, numbers of agents per step, ID of substeps with non-shared networks) are trivial in our case, as they can easily be derived from an instance of our environment. We can thus now set up a policy composer with our custom policy:
policy_composer = ProbabilisticPolicyComposer(
action_spaces_dict=env.action_spaces_dict,
observation_spaces_dict=env.observation_spaces_dict,
distribution_mapper=distribution_mapper,
networks=policy_networks,
# We have only one agent and network, thus this is an empty list.
substeps_with_separate_agent_nets=[],
# We have only one step and one agent.
agent_counts_dict={0: 1}
)
Once we have our policy composer, we are ready to train.
rc = maze.api.run_context.RunContext(
env=cartpole_env_factory,
algorithm=algorithm_config,
policy=policy_composer
)
rc.train()
Critic Customization
Customizing the critic can be done quite similarly to the policy customization, the main difference being that we do not need a probability distribution.
First we define our value network.
class CartpoleValueNet(nn.Module):
""" Simple linear value net for demonstration purposes. """
def __init__(self, obs_shapes: Dict[str, Sequence[int]]):
super().__init__()
self.value_net = nn.Sequential(nn.Linear(in_features=obs_shapes['observation'][0], out_features=1))
def forward(self, x_dict: Dict[str, torch.Tensor]) -> Dict[str, torch.Tensor]:
""" Forward method. """
# The same as for the policy can be said about the value
# net: Inputs and outputs have to be dicts.
x = x_dict['observation']
value = self.value_net(x)
value_dict = {'value': value}
return value_dict
We instantiate our policy network and wrap it in a TorchModelBlock
as done for the policy network.
value_networks = {
0: TorchModelBlock(
in_keys='observation', out_keys='value',
in_shapes=observation_space.spaces['observation'].shape,
in_num_dims=[2],
out_num_dims=2,
net=CartpoleValueNet(obs_shapes=env.observation_space.spaces['observation'].shape)
)
}
Instantiating the Critic. This step is analogous to the instantiation of the policy above. In Maze, critics can have different forms (see Value Functions (Critics)).
Here, we use a simple shared critic. Shared means that the same critic will be used for all sub-steps (in a multi-step
setting) and all actors.
Since we only have one actor in this example and are in a one-step setting, the TorchSharedStateCritic
reduces to
a vanilla StateCritic
(aka a state-dependent value function).
critic_composer = SharedStateCriticComposer(
observation_spaces_dict=env.observation_spaces_dict,
agent_counts_dict={0: 1},
networks=value_networks,
stack_observations=True
)
Training
Having instantiated customized policy and critic composers we can train our model:
rc = run_context.RunContext(
env=cartpole_env_factory,
algorithm=algorithm_config,
policy=policy_composer,
critic=critic_composer
)
rc.train()
Distributed Training
If we want to train in a distributed manner, it is sufficient to pick the appropriate runner. For now, we might want to parallelize by distributing our environments over several processes. This can be done by utilizing local runners, whose utilization is straightforward:
algorithm_config.rollout_evaluator.eval_env = SubprocVectorEnv([cartpole_env_factory])
rc = run_context.RunContext(
env=cartpole_env_factory,
algorithm=algorithm_config,
policy=policy_composer,
critic=critic_composer,
runner="local"
)
rc.train(n_epochs=1)
Evaluation
We can evaluate our performance with a RolloutEvaluator
. In order for this to work with our environment, we wrap it with a LogStatsWrapper
to ensure it has the logging capabilities required by the RolloutEvaluator
.
evaluator = RolloutEvaluator(
eval_env=LogStatsWrapper.wrap(cartpole_env_factory(), logging_prefix="eval"),
n_episodes=3,
model_selection=None
)
evaluator.evaluate(rc.policy)
Full Python Code¶
Here is the code without documentation for easier copy-pasting:
"""
Training and rollout of a policy in plain Python.
"""
from typing import Sequence, Dict
import gym
import torch
import torch.nn as nn
from maze.api.utils import RunMode
from maze.core.wrappers.maze_gym_env_wrapper import GymMazeEnv
from maze.train.parallelization.vector_env.subproc_vector_env import SubprocVectorEnv
from maze.utils.log_stats_utils import setup_logging
from maze.core.agent.torch_actor_critic import TorchActorCritic
from maze.train.trainers.a2c.a2c_trainer import A2C
from maze.train.trainers.common.model_selection.best_model_selection import BestModelSelection
from maze.train.parallelization.vector_env.sequential_vector_env import SequentialVectorEnv
from maze.train.trainers.common.evaluators.rollout_evaluator import RolloutEvaluator
from maze.core.wrappers.log_stats_wrapper import LogStatsWrapper
from maze.perception.models.critics.shared_state_critic_composer import SharedStateCriticComposer
from maze.train.trainers.a2c.a2c_algorithm_config import A2CAlgorithmConfig
from maze.api import run_context
from maze.distributions.distribution_mapper import DistributionMapper
from maze.perception.blocks.general.torch_model_block import TorchModelBlock
from maze.perception.models.policies import ProbabilisticPolicyComposer
def cartpole_env_factory() -> GymMazeEnv:
""" Env factory for the cartpole MazeEnv """
# Registered gym environments can be instantiated first and then provided to GymMazeEnv:
cartpole_env = gym.make("CartPole-v0")
maze_env = GymMazeEnv(env=cartpole_env)
# Another possibility is to supply the gym env string to GymMazeEnv directly:
# maze_env = GymMazeEnv(env="CartPole-v0")
return maze_env
class CartpolePolicyNet(nn.Module):
""" Simple linear policy net for demonstration purposes. """
def __init__(self, obs_shapes: Dict[str, Sequence[int]], action_logit_shapes: Dict[str, Sequence[int]]):
super().__init__()
self.net = nn.Sequential(
nn.Linear(
in_features=obs_shapes['observation'][0],
out_features=action_logit_shapes['action'][0]
)
)
def forward(self, x_dict: Dict[str, torch.Tensor]) -> Dict[str, torch.Tensor]:
# Since x_dict has to be a dictionary in Maze, we extract the input for the network.
x = x_dict['observation']
# Do the forward pass.
logits = self.net(x)
# Since the return value has to be a dict again, put the forward pass result into a dict with the
# correct key.
logits_dict = {'action': logits}
return logits_dict
class CartpoleValueNet(nn.Module):
""" Simple linear value net for demonstration purposes. """
def __init__(self, obs_shapes: Dict[str, Sequence[int]]):
super().__init__()
self.value_net = nn.Sequential(nn.Linear(in_features=obs_shapes['observation'][0], out_features=1))
def forward(self, x_dict: Dict[str, torch.Tensor]) -> Dict[str, torch.Tensor]:
""" Forward method. """
# The same as for the policy can be said about the value net. Inputs and outputs have to be dicts.
x = x_dict['observation']
value = self.value_net(x)
value_dict = {'value': value}
return value_dict
def train(n_epochs: int) -> int:
"""
Trains agent in pure Python.
:param n_epochs: Number of epochs to train.
:return: 0 if successful.
"""
# Environment setup
# -----------------
env = cartpole_env_factory()
# Algorithm setup
# ---------------
algorithm_config = A2CAlgorithmConfig(
n_epochs=5,
epoch_length=25,
patience=15,
critic_burn_in_epochs=0,
n_rollout_steps=100,
lr=0.0005,
gamma=0.98,
gae_lambda=1.0,
policy_loss_coef=1.0,
value_loss_coef=0.5,
entropy_coef=0.00025,
max_grad_norm=0.0,
device='cpu',
rollout_evaluator=RolloutEvaluator(
eval_env=SequentialVectorEnv([cartpole_env_factory]),
n_episodes=1,
model_selection=None,
deterministic=True
)
)
# Custom model setup
# ------------------
# Policy customization
# ^^^^^^^^^^^^^^^^^^^^
# Policy network.
policy_net = CartpolePolicyNet(
obs_shapes={'observation': env.observation_space.spaces['observation'].shape},
action_logit_shapes={'action': (env.action_space.spaces['action'].n,)}
)
policy_networks = [policy_net]
# Policy distribution.
distribution_mapper = DistributionMapper(action_space=env.action_space, distribution_mapper_config={})
# Policy composer.
policy_composer = ProbabilisticPolicyComposer(
action_spaces_dict=env.action_spaces_dict,
observation_spaces_dict=env.observation_spaces_dict,
# Derive distribution from environment's action space.
distribution_mapper=distribution_mapper,
networks=policy_networks,
# We have only one agent and network, thus this is an empty list.
substeps_with_separate_agent_nets=[],
# We have only one step and one agent.
agent_counts_dict={0: 1}
)
# Critic customization
# ^^^^^^^^^^^^^^^^^^^^
# Value networks.
value_networks = {
0: TorchModelBlock(
in_keys='observation', out_keys='value',
in_shapes=env.observation_space.spaces['observation'].shape,
in_num_dims=[2],
out_num_dims=2,
net=CartpoleValueNet({'observation': env.observation_space.spaces['observation'].shape})
)
}
# Critic composer.
critic_composer = SharedStateCriticComposer(
observation_spaces_dict=env.observation_spaces_dict,
agent_counts_dict={0: 1},
networks=value_networks,
stack_observations=True
)
# Training
# ^^^^^^^^
rc = run_context.RunContext(
env=cartpole_env_factory,
algorithm=algorithm_config,
policy=policy_composer,
critic=critic_composer,
runner="dev"
)
rc.train(n_epochs=n_epochs)
# Distributed training
# ^^^^^^^^^^^^^^^^^^^^
algorithm_config.rollout_evaluator.eval_env = SubprocVectorEnv([cartpole_env_factory])
rc = run_context.RunContext(
env=cartpole_env_factory,
algorithm=algorithm_config,
policy=policy_composer,
critic=critic_composer,
runner="local"
)
rc.train(n_epochs=n_epochs)
# Evaluation
# ^^^^^^^^^^
print("-----------------")
evaluator = RolloutEvaluator(
eval_env=LogStatsWrapper.wrap(cartpole_env_factory(), logging_prefix="eval"),
n_episodes=1,
model_selection=None
)
evaluator.evaluate(rc.policy)
return 0
if __name__ == '__main__':
train(n_epochs=1)
Plain Python Training Example (low-level)¶
This tutorial demonstrates how to train an A2C agent with Maze in plain Python without utilizing RunContext
. In the process it introduces and explains some of Maze’ most important components and concepts.
This is complementary to the article on high-level training in plain Python, which guides through the same setup (but with RunContext
support).
Environment Setup¶
We will first prepare our environment for use with Maze. In order to use Maze’s parallelization capabilities, it is necessary to define a factory function that returns a MazeEnv of your environment. This is easily done for Gym environments:
def cartpole_env_factory():
""" Env factory for the cartpole MazeEnv """
# Registered gym environments can be instantiated first and then provided to GymMazeEnv:
cartpole_env = gym.make("CartPole-v0")
maze_env = GymMazeEnv(env=cartpole_env)
# Another possibility is to supply the gym env string to GymMazeEnv directly:
maze_env = GymMazeEnv(env="CartPole-v0")
return maze_env
If you have your own environment (that is not a gym.Env
) you must transform it into a MazeEnv yourself, as is shown here, and have your factory return that. If it is a custom gym env it can be instantiated with our wrapper as shown above.
We instantiate one environment. This will be used for convenient access to observation and action spaces later.
env = cartpole_env_factory()
observation_space = env.observation_space
action_space = env.action_space
Model Setup¶
Now that the environment setup is done, let us develop the policy and value networks that will be used. We will pay special attention to emphasize the format required by Maze. When creating your own models, it is important to know two things:
Maze works with dictionaries throughout, which means that arguments for the constructor and the input and return values of the forward method are dicts with user-defined keys.
Policy networks and value network constructors have required arguments: for policy nets, these are obs_shapes and action_logit_dicts, for value nets, this is obs_shapes.
The required format is explained in more detail here. With this in mind, let us create a simple linear mapping network with the required constraints:
class CartpolePolicyNet(nn.Module):
""" Simple linear policy net for demonstration purposes. """
def __init__(self, obs_shapes: Sequence[int], action_logit_shapes: Sequence[int]):
super().__init__()
self.net = nn.Sequential(
nn.Linear(in_features=obs_shapes[0], out_features=action_logit_shapes[0])
)
def forward(self, x_dict: Dict[str, torch.Tensor]) -> Dict[str, torch.Tensor]:
# Since x_dict has to be a dictionary in Maze, we extract the input for the network.
x = x_dict['observation']
# Do the forward pass.
logits = self.net(x)
# Since the return value has to be a dict again, put the
# forward pass result into a dict with the correct key.
logits_dict = {'action': logits}
return logits_dict
# Instantiate our custom policy net.
policy_net = CartpolePolicyNet(
obs_shapes=env.observation_space.spaces['observation'].shape,
action_logit_shapes=(env.action_space.spaces['action'].n,)
)
and
class CartpoleValueNet(nn.Module):
""" Simple linear value net for demonstration purposes. """
def __init__(self, obs_shapes: Sequence[int]):
super().__init__()
self.value_net = nn.Sequential(nn.Linear(in_features=obs_shapes[0], out_features=1))
def forward(self, x_dict: Dict[str, torch.Tensor]) -> Dict[str, torch.Tensor]:
""" Forward method. """
# The same as for the policy can be said about the value
# net: Inputs and outputs have to be dicts.
x = x_dict['observation']
value = self.value_net(x)
value_dict = {'value': value}
return value_dict
Policy Setup
For a policy, we need a parametrization for the policy (provided by the policy network) and a probability distribution we can sample from. We will subsequently define and instantiate each of these.
Policy Network
Instantiate a policy with the correct shapes of observation and action spaces.
policy_net = WrappedCartpolePolicyNet(
obs_shapes=observation_space.spaces['observation'].shape,
action_logit_shapes=(action_space.spaces['action'].n,))
We can use one of Mazes capabilities, shape normalization
(see ShapeNormalizationBlock
),
with these models by wrapping them with the TorchModelBlock.
maze_wrapped_policy_net = TorchModelBlock(
in_keys='observation', out_keys='action',
in_shapes=observation_space.spaces['observation'].shape, in_num_dims=[2],
out_num_dims=2, net=policy_net)
Since Maze offers the capability of supporting multiple actors, we need to map each policy_net to its corresponding actor ID. As we have only one policy, this is a trivial mapping:
policy_networks = {0: maze_wrapped_policy_net}
Policy Distribution
Initializing the proper probability distribution for the policy is rather easy with Maze. Simply provide the DistributionMapper with the action space and you automatically get the proper distribution to use.
distribution_mapper = DistributionMapper(action_space=action_space, distribution_mapper_config={})
Optionally, you can specify a different distribution with the distribution_mapper_config argument. Using a Categorical distribution for a discrete action space would be done with
distribution_mapper = DistributionMapper(
action_space=action_space,
distribution_mapper_config=[{
"action_space": gym.spaces.Discrete,
"distribution": "maze.distributions.categorical.CategoricalProbabilityDistribution"}])
Since the standard distribution taken by Maze for a discrete action space is a Categorical distribution anyway (as can be seen here), both definitions of the distribution_mapper have the same result. For more information about the DistributionMapper, see Action Spaces and Distributions.
Instantiating the Policy
We have both necessary ingredients to define a policy: a parametrization, given by the policy network, and a distribution. With these, we can instantiate a policy. This is done with the TorchPolicy class:
torch_policy = TorchPolicy(networks=policy_networks,
distribution_mapper=distribution_mapper,
device='cpu')
Critic Setup
The setup of a critic (or value function) is similar to the setup of a policy, the main difference being that we do not need a probability distribution.
Value Network
value_net = WrappedCartpoleValueNet(obs_shapes=observation_space.spaces['observation'].shape)
maze_wrapped_value_net = TorchModelBlock(
in_keys='observation', out_keys='value',
in_shapes=observation_space.spaces['observation'].shape, in_num_dims=[2],
out_num_dims=2, net=value_net)
value_networks = {0: maze_wrapped_value_net}
Instantiating the Critic
This step is analogous to the instantiation of the policy above. In Maze, critics can have different forms (see Value Functions (Critics)). Here, we use a simple shared critic. Shared means that the same critic will be used for all sub-steps (in a multi-step setting) and all actors. Since we only have one actor in this example and are in a one-step setting, the TorchSharedStateCritic reduces to a vanilla StateCritic (aka a state-dependent value function).
torch_critic = TorchSharedStateCritic(networks=value_networks, num_policies=1, device='cpu')
Initializing the ActorCritic Model.
In Maze, policies and critics are encapsulated by an ActorCritic model. Details about this can be found in Actor-Critics. We will use A2C to train the cartpole env. The correct ActorCritic model to use for A2C is the TorchActorCritic:
actor_critic_model = TorchActorCritic(policy=torch_policy, critic=torch_critic, device='cpu')
Trainer Setup¶
The last steps will be the instantiations of the algorithm and corresponding trainer. We use A2C for this example. The algorithm_config for A2C can be found here. The hyperparameters will be supplied to Maze with an algorithm-dependent AlgorithmConfig object. The one for A2C is A2CAlgorithmConfig. We will use the default parameters, which can also be found here.
algorithm_config = A2CAlgorithmConfig(
n_epochs=5,
epoch_length=25,
patience=15,
critic_burn_in_epochs=0,
n_rollout_steps=100,
lr=0.0005,
gamma=0.98,
gae_lambda=1.0,
policy_loss_coef=1.0,
value_loss_coef=0.5,
entropy_coef=0.00025,
max_grad_norm=0.0,
device='cpu',
rollout_evaluator=RolloutEvaluator(
eval_env=SequentialVectorEnv([cartpole_env_factory]),
n_episodes=1,
model_selection=None,
deterministic=True
)
)
In order to use the distributed trainers, we create a vector environment (i.e., multiple environment instances encapsulated to be stepped simultaneously) using the environment factory function:
train_envs = SequentialVectorEnv(
[cartpole_env_factory for _ in range(2)], logging_prefix="train")
eval_envs = SequentialVectorEnv(
[cartpole_env_factory for _ in range(2)], logging_prefix="eval")
(In this case, we create sequential vector environments, i.e. all environment instances are located in the main process and stepped sequentially. When we are ready to scale the training, we might want to use e.g. sub-process distributed vector environments.)
For this example, we want to save the parameters of the best model in terms of mean achieved reward. This is done
with the
BestModelSelection
class,
an instance of which will be provided to the trainer.
model_selection = BestModelSelection(dump_file="params.pt", model=actor_critic_model)
We can now instantiate an A2C trainer:
a2c_trainer = A2C(
env=train_envs,
algorithm_config=algorithm_config,
model=actor_critic_model,
model_selection=model_selection,
evaluator=algorithm_config.rollout_evaluator
)
Train the Agent¶
Before starting the training, we will enable logging by calling
log_dir = '.'
setup_logging(job_config=None, log_dir=log_dir)
Now, we can train the agent.
a2c_trainer.train()
To get an out-of sample estimate of our performance, evaluate on the evaluation envs:
a2c_trainer.evaluate(deterministic=False, repeats=1)
Full Python Code¶
Here is the code without documentation for easier copy-pasting:
""" Rollout of a policy in plain Python. """
from typing import Dict, Sequence
import gym
import torch
import torch.nn as nn
from maze.core.agent.torch_actor_critic import TorchActorCritic
from maze.core.agent.torch_policy import TorchPolicy
from maze.core.agent.torch_state_critic import TorchSharedStateCritic
from maze.core.rollout.rollout_generator import RolloutGenerator
from maze.core.wrappers.maze_gym_env_wrapper import GymMazeEnv
from maze.distributions.distribution_mapper import DistributionMapper
from maze.perception.blocks.general.torch_model_block import TorchModelBlock
from maze.train.parallelization.vector_env.sequential_vector_env import SequentialVectorEnv
from maze.train.trainers.a2c.a2c_algorithm_config import A2CAlgorithmConfig
from maze.train.trainers.a2c.a2c_trainer import A2C
from maze.train.trainers.common.evaluators.rollout_evaluator import RolloutEvaluator
from maze.train.trainers.common.model_selection.best_model_selection import BestModelSelection
from maze.utils.log_stats_utils import setup_logging
# Environment Setup
# =================
# Environment Factory
# -------------------
# Define environment factory
def cartpole_env_factory():
""" Env factory for the cartpole MazeEnv """
# Registered gym environments can be instantiated first and then provided to GymMazeEnv:
cartpole_env = gym.make("CartPole-v0")
maze_env = GymMazeEnv(env=cartpole_env)
# Another possibility is to supply the gym env string to GymMazeEnv directly:
maze_env = GymMazeEnv(env="CartPole-v0")
return maze_env
# Model Setup
# ===========
# Policy Network
# --------------
class CartpolePolicyNet(nn.Module):
""" Simple linear policy net for demonstration purposes. """
def __init__(self, obs_shapes: Dict[str, Sequence[int]], action_logit_shapes: Dict[str, Sequence[int]]):
super().__init__()
self.net = nn.Sequential(
nn.Linear(in_features=obs_shapes['observation'][0],
out_features=action_logit_shapes['action'][0])
)
def forward(self, x_dict: Dict[str, torch.Tensor]) -> Dict[str, torch.Tensor]:
# Since x_dict has to be a dictionary in Maze, we extract the input for the network.
x = x_dict['observation']
# Do the forward pass.
logits = self.net(x)
# Since the return value has to be a dict again, put the forward pass result into a dict with the
# correct key.
logits_dict = {'action': logits}
return logits_dict
# Value Network
# -------------
class CartpoleValueNet(nn.Module):
""" Simple linear value net for demonstration purposes. """
def __init__(self, obs_shapes: Dict[str, Sequence[int]]):
super().__init__()
self.value_net = nn.Sequential(nn.Linear(in_features=obs_shapes['observation'][0], out_features=1))
def forward(self, x_dict: Dict[str, torch.Tensor]) -> Dict[str, torch.Tensor]:
""" Forward method. """
# The same as for the policy can be said about the value net. Inputs and outputs have to be dicts.
x = x_dict['observation']
value = self.value_net(x)
value_dict = {'value': value}
return value_dict
def train(n_epochs):
# Instantiate one environment. This will be used for convenient access to observation
# and action spaces.
env = cartpole_env_factory()
observation_space = env.observation_space
action_space = env.action_space
# Policy Setup
# ------------
# Policy Network
# ^^^^^^^^^^^^^^
# Instantiate policy with the correct shapes of observation and action spaces.
policy_net = CartpolePolicyNet(
obs_shapes={'observation': observation_space.spaces['observation'].shape},
action_logit_shapes={'action': (action_space.spaces['action'].n,)})
maze_wrapped_policy_net = TorchModelBlock(
in_keys='observation', out_keys='action',
in_shapes=observation_space.spaces['observation'].shape, in_num_dims=[2],
out_num_dims=2, net=policy_net)
policy_networks = {0: maze_wrapped_policy_net}
# Policy Distribution
# ^^^^^^^^^^^^^^^^^^^
distribution_mapper = DistributionMapper(
action_space=action_space,
distribution_mapper_config={})
# Optionally, you can specify a different distribution with the distribution_mapper_config argument. Using a
# Categorical distribution for a discrete action space would be done via
distribution_mapper = DistributionMapper(
action_space=action_space,
distribution_mapper_config=[{
"action_space": gym.spaces.Discrete,
"distribution": "maze.distributions.categorical.CategoricalProbabilityDistribution"}])
# Instantiating the Policy
# ^^^^^^^^^^^^^^^^^^^^^^^^
torch_policy = TorchPolicy(networks=policy_networks, distribution_mapper=distribution_mapper, device='cpu')
# Value Function Setup
# --------------------
# Value Network
# ^^^^^^^^^^^^^
value_net = CartpoleValueNet(obs_shapes={'observation': observation_space.spaces['observation'].shape})
maze_wrapped_value_net = TorchModelBlock(
in_keys='observation', out_keys='value',
in_shapes=observation_space.spaces['observation'].shape, in_num_dims=[2],
out_num_dims=2, net=value_net)
value_networks = {0: maze_wrapped_value_net}
# Instantiate the Value Function
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch_critic = TorchSharedStateCritic(networks=value_networks, obs_spaces_dict=env.observation_spaces_dict,
device='cpu', stack_observations=False)
# Initializing the ActorCritic Model.
# -----------------------------------
actor_critic_model = TorchActorCritic(policy=torch_policy, critic=torch_critic, device='cpu')
# Instantiating the Trainer
# =========================
algorithm_config = A2CAlgorithmConfig(
n_epochs=n_epochs,
epoch_length=25,
patience=15,
critic_burn_in_epochs=0,
n_rollout_steps=100,
lr=0.0005,
gamma=0.98,
gae_lambda=1.0,
policy_loss_coef=1.0,
value_loss_coef=0.5,
entropy_coef=0.00025,
max_grad_norm=0.0,
device='cpu',
rollout_evaluator=RolloutEvaluator(
eval_env=SequentialVectorEnv([cartpole_env_factory]),
n_episodes=1,
model_selection=None,
deterministic=True
)
)
# Distributed Environments
# ------------------------
# In order to use the distributed trainers, the previously created env factory is supplied to one of Maze's
# distribution classes:
train_envs = SequentialVectorEnv([cartpole_env_factory for _ in range(2)], logging_prefix="train")
eval_envs = SequentialVectorEnv([cartpole_env_factory for _ in range(2)], logging_prefix="eval")
# Initialize best model selection.
model_selection = BestModelSelection(dump_file="params.pt", model=actor_critic_model)
a2c_trainer = A2C(rollout_generator=RolloutGenerator(train_envs),
evaluator=algorithm_config.rollout_evaluator,
algorithm_config=algorithm_config,
model=actor_critic_model,
model_selection=model_selection)
# Train the Agent
# ===============
# Before starting the training, we will enable logging by calling
log_dir = '.'
setup_logging(job_config=None, log_dir=log_dir)
# Now, we can train the agent.
a2c_trainer.train()
return 0
if __name__ == '__main__':
train(n_epochs=5)
Tensorboard and Command Line Logging¶
This page gives a brief overview of the Tensorboard and command line logging facilities of Maze. We will show examples based on the cutting-2D Maze environment to make things a bit more interesting.
To understand the underlying concepts we recommend to read the sections on event and KPI logging as well as on the Maze event system.
Tensorboard Logging¶
To watch the training progress with Tensorboard start it by running:
tensorboard --logdir outputs/
and view it with your browser at http://localhost:6006/.
You will get an output similar to the one shown in the image below.

To keep everything organized and avoid having to browse through tons of pages we group the contained items into semantically connected sections:
Since Maze allows you to use different environments for training and evaluation, each logging section has a train_ or eval_ prefix to show if the corresponding stats were logged as part of the training or the evaluation environment.
The BaseEnvEvents sections (i.e., eval_BaseEnvEvents and train_BaseEnvEvents contain general statistics such as rewards or step counts. This section is always present, independent of the environment used.
Other sections are specific to the environment used. In the example above, these are the CuttingEvents and the InventoryEvents.
In addition, we get one additional section containing stats of the trainer used, called train_NameOfTrainerEvents. It contains statistics such as policy loss, gradient norm or value loss. This section is not present for the evaluation environment.
The gallery below shows some additional useful examples and features of the Maze Tensorboard log (click the images to display them in large).
Command Line Logging¶
Whenever you start a training run you will also get a command line output similar to the one shown below. Analogously to the Tensorboard log, Maze distinguishes between train and eval outputs and groups the items into semantically connected output blocks.
step|path | value
=====|============================================================================|====================
1|train MultiStepActorCritic..time_rollout ······················| 1.091
1|train MultiStepActorCritic..learning_rate ······················| 0.000
1|train MultiStepActorCritic..policy_loss 0 | -0.000
1|train MultiStepActorCritic..policy_grad_norm 0 | 0.001
1|train MultiStepActorCritic..policy_entropy 0 | 1.593
1|train MultiStepActorCritic..policy_loss 1 | -0.000
1|train MultiStepActorCritic..policy_grad_norm 1 | 0.008
1|train MultiStepActorCritic..policy_entropy 1 | 0.295
1|train MultiStepActorCritic..critic_value 0 | -0.199
1|train MultiStepActorCritic..critic_value_loss 0 | 116.708
1|train MultiStepActorCritic..critic_grad_norm 0 | 0.500
1|train MultiStepActorCritic..time_update ······················| 1.642
1|train DiscreteActionEvents action substep_0/piece_idx | [len:4000, μ:54.8]
1|train BaseEnvEvents reward median_step_count | 200.000
1|train BaseEnvEvents reward mean_step_count | 200.000
1|train BaseEnvEvents reward total_step_count | 4000.000
1|train BaseEnvEvents reward total_episode_count | 20.000
1|train BaseEnvEvents reward episode_count | 20.000
1|train BaseEnvEvents reward std | 1.465
1|train BaseEnvEvents reward mean | -71.950
1|train BaseEnvEvents reward min | -75.000
1|train BaseEnvEvents reward max | -70.000
1|train DiscreteActionEvents action substep_1/order | [len:4000, μ:0.5]
1|train DiscreteActionEvents action substep_1/rotation | [len:4000, μ:0.5]
1|train InventoryEvents piece_replenished mean_episode_total | 71.950
1|train InventoryEvents pieces_in_inventory step_max | 163.000
1|train InventoryEvents pieces_in_inventory step_mean | 69.946
1|train CuttingEvents valid_cut mean_episode_total | 200.000
1|train BaseEnvEvents kpi max/raw_piece_usage_..| 0.375
1|train BaseEnvEvents kpi min/raw_piece_usage_..| 0.350
1|train BaseEnvEvents kpi std/raw_piece_usage_..| 0.007
1|train BaseEnvEvents kpi mean/raw_piece_usage..| 0.360
Time required for epoch: 19.43s
Update epoch - 1
step|path | value
=====|============================================================================|====================
2|eval DiscreteActionEvents action substep_0/piece_idx | [len:800, μ:53.2]
2|eval BaseEnvEvents reward median_step_count | 200.000
2|eval BaseEnvEvents reward mean_step_count | 200.000
2|eval BaseEnvEvents reward total_step_count | 1600.000
2|eval BaseEnvEvents reward total_episode_count | 8.000
2|eval BaseEnvEvents reward episode_count | 4.000
2|eval BaseEnvEvents reward std | 1.414
2|eval BaseEnvEvents reward mean | -71.000
2|eval BaseEnvEvents reward min | -73.000
2|eval BaseEnvEvents reward max | -69.000
2|eval DiscreteActionEvents action substep_1/order | [len:800, μ:0.5]
2|eval DiscreteActionEvents action substep_1/rotation | [len:800, μ:0.5]
2|eval InventoryEvents piece_replenished mean_episode_total | 71.000
2|eval InventoryEvents pieces_in_inventory step_max | 145.000
2|eval InventoryEvents pieces_in_inventory step_mean | 68.031
2|eval CuttingEvents valid_cut mean_episode_total | 200.000
2|eval BaseEnvEvents kpi max/raw_piece_usage_..| 0.365
2|eval BaseEnvEvents kpi min/raw_piece_usage_..| 0.345
2|eval BaseEnvEvents kpi std/raw_piece_usage_..| 0.007
2|eval BaseEnvEvents kpi mean/raw_piece_usage..| 0.355
Where to Go Next¶
For further details please see the reference documentation.
For the bigger picture we refer to event and KPI logging as well as the Maze event system.
You might be also interested in observation distribution logging and action distribution logging.
Event and KPI Logging¶
Monitoring only standard metrics such as reward or episode step count is not always sufficiently informative about the agent’s behaviour and the problem at hand. To tackle this issue and to enable better inspection and logging tools for both, agents and environments, we introduce an event and key performance indicator (KPI) logging system. It is based on the more general event system and allows us to log and monitor environment specific metrics.
The figure below shows a conceptual overview of the logging system. In the remainder of this page we will go through the components in more detail.

Events¶
In this section we describe the event logging system from an usage perspective. To understand how this is embedded in the broader context of a Maze environment we refer to the environments and KPI section of our step by step tutorial as well as the dedicated section on the underlying event system.
In general, events can be define for any component involved in the RL process (e.g., environments, agents, …).
They get fired by the respective component whenever they occur during the agent environment interaction loop.
For logging, events are collected and aggregated via the
LogStatsWrapper
.
To provide full flexibility Maze allows to customize which statistics are computed
at which stage of the aggregation process via event decorators (step, episode, epoch).
The code snipped below contains an example for an event called invalid_piece_selected
borrowed from the cutting 2D tutorial.
class CuttingEvents(ABC):
"""Events related to the cutting process."""
@define_epoch_stats(np.mean, output_name="mean_episode_total")
@define_episode_stats(sum)
@define_step_stats(len)
def invalid_piece_selected(self):
"""An invalid piece is selected for cutting."""
The snippet defines the following statistics aggregation hierarchy:
Step Statistics [@define_step_stats(len)
]:
in each environment step events \(e_i\) are collected as lists of events \(\{e_i\}\).
The function len
associated with the decorator counts
how often such an event occurred in the current step \(Stats_{Step}=|\{e_i\}|\)
(e.g., length of invalid_piece_selected
event list).
Episode Statistics [@define_episode_stats(sum)
]:
defines how the \(S\) step statistics should be aggregated to episode statistics
(e.g., by simply summing them up: \(Stats_{Episode}=\sum^S Stats_{Step})\)
Epoch Statistics [@define_epoch_stats(np.mean, output_name="mean_episode_total")
]:
a training epoch consists of N episodes.
This stage defines how these N episode statistics are averaged to epoch statistics
(e.g., the mean of the contained episodes: \(Stats_{Epoch}=(\sum^N Stats_{Episode})/N\)).
The figure below provides a visual summary of the entire event statistics aggregation hierarchy as well as its relation to KPIs which will be explained in the next section. In Tensorboard and on the command line these events get then logged in dedicated sections (e.g., as CuttingEvents).

Key Performance Indicators (KPIs)¶
In applied RL settings the reward is not always the target metric
we aim at optimizing from an economical perspective.
Sometimes rewards are heavily shaped to get the agent to learn the right behaviour.
This makes it hard to interpret for humans.
For such cases Maze supports computing and logging of additional Key Performance Indicators (KPIs)
along with the reward via the KpiCalculator
implemented as a part of the CoreEnv
(as reward KPIs are logged as BaseEnvEvents).
KPIs are in contrast to events computed in an aggregated form at the end of an episode
triggered by the reset()
method of the
LogStatsWrapper
.
This is why we can compute them in a normalized fashion (e.g., dived by the total number of steps in an episode).
Conceptually KPIs life on the same level as episode statistics in the logging hierarchy (see figure above).
For further details on how to implement a concrete KPI calculator we refer to the KPI section of our tutorial.
Plain Python Configuration¶
When working with the CLI and Hydra configs all components necessary for logging are automatically instantiated under the hood. In case you would like to test or run your logging setup directly from Python you can start with the snippet below.
from docs.tutorial_maze_env.part04_events.env.maze_env import maze_env_factory
from maze.utils.log_stats_utils import SimpleStatsLoggingSetup
from maze.core.wrappers.log_stats_wrapper import LogStatsWrapper
# init maze environment
env = maze_env_factory(max_pieces_in_inventory=200, raw_piece_size=[100, 100],
static_demand=(30, 15))
# wrap environment with logging wrapper
env = LogStatsWrapper(env, logging_prefix="main")
# register a console writer and connect the writer to the statistics logging system
with SimpleStatsLoggingSetup(env):
# reset environment and run interaction loop
obs = env.reset()
for i in range(15):
action = env.action_space.sample()
obs, reward, done, info = env.step(action)
To get access to event and KPI logging we need to wrap the environment with the
LogStatsWrapper
.
To simplify the statistics logging setup we rely on the
SimpleStatsLoggingSetup
helper class.
When running the script you will get an output as shown below. Note that statistics of both, events and KPIs, are printed along with default reward or action statistics.
step|path | value
=====|==========================================================================|====================
1|main DiscreteActionEvents action substep_0/order | [len:15, μ:0.5]
1|main DiscreteActionEvents action substep_0/piece_idx | [len:15, μ:82.3]
1|main DiscreteActionEvents action substep_0/rotation | [len:15, μ:0.7]
1|main BaseEnvEvents reward median_step_count | 15.000
1|main BaseEnvEvents reward mean_step_count | 15.000
1|main BaseEnvEvents reward total_step_count | 15.000
1|main BaseEnvEvents reward total_episode_count | 1.000
1|main BaseEnvEvents reward episode_count | 1.000
1|main BaseEnvEvents reward std | 0.000
1|main BaseEnvEvents reward mean | -29.000
1|main BaseEnvEvents reward min | -29.000
1|main BaseEnvEvents reward max | -29.000
1|main InventoryEvents piece_replenished mean_episode_total | 3.000
1|main InventoryEvents pieces_in_inventory step_max | 200.000
1|main InventoryEvents pieces_in_inventory step_mean | 200.000
1|main CuttingEvents invalid_cut mean_episode_total | 14.000
1|main InventoryEvents piece_discarded mean_episode_total | 2.000
1|main CuttingEvents valid_cut mean_episode_total | 1.000
1|main BaseEnvEvents kpi max/raw_piece_usage_..| 0.000
1|main BaseEnvEvents kpi min/raw_piece_usage_..| 0.000
1|main BaseEnvEvents kpi std/raw_piece_usage_..| 0.000
1|main BaseEnvEvents kpi mean/raw_piece_usage..| 0.000
Where to Go Next¶
You can learn more about the general event system.
For a more implementation oriented summary you can visit the events and KPI section of our tutorial.
To see another application of the event system you can read up on reward customization and shaping.
Action Distribution Visualization¶
There are situations where it turns out to be extremely useful to watch the evolution of an agent’s sampling behaviour throughout the training process. Looking at the action sampling distribution often provides a first intuition about the agent’s behaviour without the need to look at individual rollouts.
However, most importantly it is a great debugging tool immediately revealing if:
the weights of the policy collapsed during training (e.g., the agent starts sampling always the same actions even though this does not make sense for the environment at hand).
observations are properly normalized and the weights of the policy are initialized accordingly to result in a healthy initial sampling behaviour of the untrained model (e.g., each discrete action is taken a similar number of times when starting training).
biasing the weights of the policy output layer results in the expected sampling behaviour (e.g., initially sampling an action twice as often as the remaining ones).
the agents actually starts learning (i.e., the sampling distributions changes throughout the training epochs).
To activate action logging you only have to add the
MazeEnvMonitoringWrapper
to your environment wrapper stack in your yaml config:
# @package wrappers
maze.core.wrappers.monitoring_wrapper.MazeEnvMonitoringWrapper:
observation_logging: false
action_logging: true
reward_logging: false
Action sampling distributions are then visualized on a per-epoch basis in the IMAGES tab of Tensorboard. By using the slider above the images you can step through the training epochs and see how the sampling distribution evolves over time.
Discrete and Multi Binary Actions¶
Each action space has a dedicated visualization assigned. Discrete and multi-binary action spaces are visualized via histograms. The example below shows an action sampling distribution for the discrete version of LunarLander-v2. The indices on the x-axis correspond to the available actions:
Action \(a_0\) - do nothing
Action \(a_1\) - fire left orientation engine
Action \(a_2\) - fire main engine
Action \(a_3\) - fire right orientation engine

We can see that action \(a_2\) (fire main engine) is taken most often, which is reasonable for this environment.
Continuous Actions¶
Continuous actions (Box spaces) are visualized via violin plots. The example below shows an action sampling distribution for LunarLanderContinuous-v2. The indices on the x-axis correspond to the available actions:
Action \(a_1\) - controls the main engine:
\(a_1 \in [-1, 0]\): off
\(a_1 \in (0, 1]\) throttle from 50% to 100% power (can’t work with less than 50%).
Action \(a_2\) - controls the orientation engines:
\(a_2 \in [-1.0, -0.5]\): fire left engine
\(a_2 \in [0.5, 1.0]\): fire right engine
\(a_2 \in (-0.5, 0.5)\): off

For the first action, corresponding to the main engine, values closer to 1.0 are sampled more often which is similar to the discrete case above.
Where to Go Next¶
You might be also interested in logging observation distributions.
Observation Logging¶
Maze provides the following options to monitor and inspect the observations presented to your policy and value networks throughout the training process:
Warning
Observation visualization and logging are supported as opt-in features via dedicated wrappers. We recommend to use them only for debugging and inspection purposes. Once everything is on track and training works as expected we suggest to remove (deactivate) the wrappers especially when dealing with environments with large observations. If you forget to remove it training might get slow and the memory consumption of Tensorboard might explode!
Observation Distribution Visualization¶
Watching the evolution of distributions and value ranges of observations is especially useful for debugging your experiments and training runs as it reveals if:
observations stay within an expected value range.
observation normalization is applied correctly.
observations drift as the agent’s behaviour evolves throughout training.
To activate observation logging you only have to add the
MazeEnvMonitoringWrapper
to your environment wrapper stack in your yaml config:
# @package wrappers
maze.core.wrappers.monitoring_wrapper.MazeEnvMonitoringWrapper:
observation_logging: true
action_logging: false
reward_logging: false
If you are using plain Python you can start with the code snippet below.
from maze.core.wrappers.maze_gym_env_wrapper import GymMazeEnv
from maze.core.wrappers.monitoring_wrapper import MazeEnvMonitoringWrapper
env = GymMazeEnv(env="CartPole-v0")
env = MazeEnvMonitoringWrapper.wrap(env, observation_logging=True, action_logging=False, reward_logging=False)
For both cases observations will be logged and distribution plots will be added to Tensorboard.
Maze visualizes observations on a per-epoch basis in the DISTRIBUTIONS and HISTOGRAMS tab of Tensorboard. By using the slider above the graphs you can step through the training epochs and see how the observation distribution evolves over time.
Below you see an example for both versions (just click the figure to view it in large).


Note that two different versions of the observation distribution are logged:
observation_original: distribution of the original observation returned by the environment.
observation_processed: distribution of the observation after processing (e.g. pre-processing or normalization).
This is useful to verify if the applied observation processing steps yield the expected result.
Observation Visualization¶
Maze additionally provides the option to directly visualizes observations presented to your policy and value networks as images in Tensorboard.
To activate observation visualization you only have to add the
ObservationVisualizationWrapper
to your environment wrapper stack in your yaml config:
# @package wrappers
maze.core.wrappers.observation_visualization_wrapper.ObservationVisualizationWrapper:
plot_function: my_project.visualization_functions.plot_1c_image_stack
and provide a reference to a custom plotting function (here, plot_1c_image_stack
).
from typing import List, Tuple
import numpy as np
import matplotlib.pyplot as plt
def plot_1c_image_stack(value: List[np.ndarray], groups: Tuple[str, str], **kwargs) -> None:
"""Plots a stack of single channel images with shape [N_STACK x H x W] using imshow.
:param value: A list of image stacks.
:param groups: A tuple containing step key and observation name.
:param kwargs: Additional plotting relevant arguments.
"""
# extract step key and observation name to enter appropriate plotting branch
step_key, obs_name = groups
fig = None
# check which observation of the dict-space to visualize
if step_key == 'step_key_0' and obs_name == 'observation-rgb2gray-resize_img':
# randomly select one observation
idx = np.random.random_integers(0, len(value), size=1)[0]
obs = value[idx]
assert obs.ndim == 3
n_channels = obs.shape[0]
min_val, max_val = np.min(obs), np.max(obs)
# plot the observation
fig = plt.figure(figsize=(max(5, 5 * n_channels), 5))
for i, img in enumerate(obs):
plt.subplot(1, n_channels, i+1)
plt.imshow(img, interpolation="nearest", vmin=min_val, vmax=max_val, cmap="magma")
plt.colorbar()
return fig
The function above visualizes the observation observation-rgb2gray-resize_img (a single-channel image stack) as a subplot containing three individual images:

Where to Go Next¶
You might be also interested in logging action distributions.
You can learn more about observation pre-processing and observation normalization.
Runner Concept¶
In Maze, Runners are the entity responsible for launching and administering any job you start from a command line (like training or rollouts). They interpret the configuration and make sure the appropriate elements (models, trainers, etc.) are created, configured, and launched.
For a more detailed description of the runner concept, see Hydra overview. If you need to write custom runners for your project, see the documentation for custom configuration.