Maze: Applied Reinforcement Learning with Python

Maze is an application oriented Reinforcement Learning framework with the vision to:
  • Enable AI-based optimization for a wide range of industrial decision processes.
  • Make RL as a technology accessible to industry and developers.
Our ultimate goal is to cover the complete development life cycle of RL applications ranging from simulation engineering up to agent development, training and deployment.

Getting Started

Installation

pip To install Maze with pip, run:

pip install -U maze-rl

Note

Pip does not install PyTorch, you need to make sure it is available in your Python environment.

Note

Maze is compatible with Python 3.7 to 3.9. We encourage you to start with Python 3.7, as many popular environments like Atari or Box2D can not easily be installed in newer Python environments. If you use a Python 3.9 environment, you might need to install a few additional dependencies because of this OpenAI gym issue (for Debian systems sudo apt install libjpeg8-dev zlib1g-dev, more info on building pillow)

rllib If you want to use RLLib it in combination with Maze, optionally install it with:

pip install ray[rllib]==1.4.1 tensorflow

(Installing RLlib is only required if you would like to use the Maze RLlib Runner)


github To install the bleeding-edge development version from github, first clone the repo.

git clone https://github.com/enlite-ai/maze.git
cd maze

Finally, install the project with pip in development mode and you are ready to start developing.

pip install -e .

Alternatively you can install with pip directly from the GitHub repository

pip install git+https://github.com/enlite-ai/maze.git

A First Example

This example shows how to train and rollout a policy for the CartPole environment with A2C. It also gives a small glimpse into the Maze framework.

Training and Rollouts

To train a policy with the synchronous advantage actor-critic (A2C), run:

$ maze-run -cn conf_train env.name=CartPole-v0 algorithm=a2c algorithm.n_epochs=5

All training outputs including model weights will be stored in outputs/<exp-dir>/<time-stamp> (for example outputs/gym_env-flatten_concat-a2c-None-local/2021-02-23_23-09-25/).

To perform rollouts for evaluating the trained policy, run:

$ maze-run env.name=CartPole-v0 policy=torch_policy input_dir=outputs/<exp-dir>/<time-stamp>

This performs 50 rollouts and prints the resulting performance statistics to the command line:

 step|path                                                              |             value
=====|==================================================================|==================
    1|rollout_stats  DiscreteActionEvents  action  substep_0/action     | [len:7900, μ:0.5]
    1|rollout_stats  BaseEnvEvents         reward  median_step_count    |           157.500
    1|rollout_stats  BaseEnvEvents         reward  mean_step_count      |           158.000
    1|rollout_stats  BaseEnvEvents         reward  total_step_count     |          7900.000
    1|rollout_stats  BaseEnvEvents         reward  total_episode_count  |            50.000
    1|rollout_stats  BaseEnvEvents         reward  episode_count        |            50.000
    1|rollout_stats  BaseEnvEvents         reward  std                  |            31.843
    1|rollout_stats  BaseEnvEvents         reward  mean                 |           158.000
    1|rollout_stats  BaseEnvEvents         reward  min                  |            83.000
    1|rollout_stats  BaseEnvEvents         reward  max                  |           200.000

To see the policy directly in action you can also perform sequential rollouts with rendering:

$ maze-run env.name=CartPole-v0 policy=torch_policy input_dir=outputs/<exp-dir>/<time-stamp> \
  runner=sequential runner.render=true

Note

Managed rollouts are not yet fully supported by our Python API, but will follow soon.

_images/cartpole_img.png

Tensorboard

To watch the training progress with Tensorboard start it by running:

tensorboard --logdir outputs/

and view it with your browser at http://localhost:6006/.

_images/tensorboard_screenshot.png

Training Outputs

For easier reproducibility Maze writes the full configuration compiled with Hydra to the command line an preserves it in the TEXT tab of Tensorboard along with the original training command.

algorithm:
  critic_burn_in_epochs: 0
  deterministic_eval: false
  device: cpu
  entropy_coef: 0.00025
  epoch_length: 25
  eval_repeats: 2
  gae_lambda: 1.0
  gamma: 0.98
  lr: 0.0005
  max_grad_norm: 0.0
  n_epochs: 5
  n_rollout_steps: 100
  patience: 15
  policy_loss_coef: 1.0
  value_loss_coef: 0.5
env:
  _target_: maze.core.wrappers.maze_gym_env_wrapper.make_gym_maze_env
  name: CartPole-v0
input_dir: ''
log_base_dir: outputs
model:
...

You will also find PDFs showing the inference graphs of the policy and critic networks in the experiment output directory. This turns out to be extremely useful when playing around with model architectures or when returning to experiments at a later stage.

_images/cartpole_concat_policy_graph.png _images/cartpole_concat_critic_graph.png

Maze - Step by Step

This tutorial provides a step by step guide explaining how to implement your own Maze environment and get the best out of its features. We will do this based on the online version of the Guillotine 2D Cutting Stock Problem as it is still relatively simple but exhibits the required problem structure to introduce all relevant Maze concepts.

Before diving into this tutorial we recommend to read up on the Maze Environment Hierarchy. You can of course also do this along the way following the provided links to explanations of the required concepts when we get there.

The remainder of this tutorial is structured as follows:

Cutting-2D Problem Specification

This page introduces the problem we would like to address with a Deep Reinforcement Learning agent: an online version of the Guillotine 2D Cutting Stock Problem.

Description of Problem:

_images/problem_overview.png
  • In each step there is one new incoming customer order generated according to a certain demand pattern.

  • This customer order has to be fulfilled by cutting the exact x/y-dimensions from a set of available candidate pieces in the inventory.

  • A new raw piece is transferred to the inventory every time the current raw piece in inventory is used to cut and deliver a customer order.

  • The goal is to use as few raw pieces as possible throughout the episode, which can be achieved by following a clever cutting policy.

Agent-Environment Interaction Loop:

To make the problem more explicit from an RL perspective we formulate it according to the agent-environment interaction loop shown below.

_images/problem_mdp.png
  • The State contains the dimensions of the currently pending customer orders and all pieces on inventory.

  • The Reward is specified to discourage the usage of raw inventory pieces.

  • The Action is a joint action consisting of the following components (see image below for details):

    • Action \(a_0\): Cutting piece selection (decides which piece from inventory to use for cutting)

    • Action \(a_1\): Cutting orientation selection (decides the orientation of the customer)

    • Action \(a_2\): Cutting order selection (decides which cut to take first; x or y)

_images/cutting_parameters1.png

Given this description of the problem we will now proceed with implementing a corresponding simulation.

Implementing the CoreEnv

The complete code for this part of the tutorial can be found here

# file structure
- cutting_2d
    - main.py
    - env
        - core_env.py
        - inventory.py
        - maze_state.py
        - maze_action.py
CoreEnv

The first component we need to implement is the Core Environment which defines the main mechanics and functionality of the environment.

For this example we will call it Cutting2DCoreEnvironment. As for any other Gym environment we need to implement several methods according to the CoreEnv interface. We will start with the very basic components and add more and more features (complexity) throughout this tutorial:

  • step(): Implements the cutting mechanics.

  • reset(): Resets the environment as well as the piece inventory.

  • seed(): Sets the random state of the environment for reproducibility.

  • close(): Can be used for cleanup.

  • get_maze_state(): Returns the current MazeState of the environment.

You can find the implementation of the basic version of the Cutting2DCoreEnvironment below.

env/core_env.py
from typing import Union, Tuple, Dict, Any

import numpy as np

from maze.core.env.core_env import CoreEnv
from maze.core.env.structured_env import ActorID

from .maze_state import Cutting2DMazeState
from .maze_action import Cutting2DMazeAction
from .inventory import Inventory


class Cutting2DCoreEnvironment(CoreEnv):
    """Environment for cutting 2D pieces based on the customer demand. Works as follows:
     - Keeps inventory of 2D pieces available for cutting and fulfilling the demand.
     - Produces a new demand for one piece in every step (here a static demand).
     - The agent should decide which piece from inventory to cut (and how) to fulfill the given demand.
     - What remains from the cut piece is put back in inventory.
     - All the time, one raw (full-size) piece is available in inventory.
       (If it gets cut, it is replenished in the next step.)
     - Rewards are calculated to motivate the agent to consume as few raw pieces as possible.
     - If inventory gets full, the oldest pieces get discarded.

    :param max_pieces_in_inventory: Size of the inventory.
    :param raw_piece_size: Size of a fresh raw (= full-size) piece.
    :param static_demand: Order to issue in each step.
    """

    def __init__(self, max_pieces_in_inventory: int, raw_piece_size: (int, int),
                 static_demand: (int, int)):
        super().__init__()

        self.max_pieces_in_inventory = max_pieces_in_inventory
        self.raw_piece_size = tuple(raw_piece_size)
        self.current_demand = static_demand

        # setup environment
        self._setup_env()

    def _setup_env(self):
        """Setup environment."""
        self.inventory = Inventory(self.max_pieces_in_inventory, self.raw_piece_size)
        self.inventory.replenish_piece()

    def step(self, maze_action: Cutting2DMazeAction) \
            -> Tuple[Cutting2DMazeState, np.array, bool, Dict[Any, Any]]:
        """Summary of the step (simplified, not necessarily respecting the actual order in the code):
        1. Check if the selected piece to cut is valid (i.e. in inventory, large enough etc.)
        2. Attempt the cutting
        3. Replenish a fresh piece if needed and return an appropriate reward

        :param maze_action: Cutting MazeAction to take.
        :return: maze_state, reward, done, info
        """

        info, reward = {}, 0
        replenishment_needed = False

        # check if valid piece id was selected
        if maze_action.piece_id >= self.inventory.size():
            info['error'] = 'piece_id_out_of_bounds'
        # perform cutting
        else:
            piece_to_cut = self.inventory.pieces[maze_action.piece_id]

            # attempt the cut
            if self.inventory.cut(maze_action, self.current_demand):
                info['msg'] = "valid_cut"
                replenishment_needed = piece_to_cut == self.raw_piece_size
            else:
                # assign a negative reward for invalid cutting attempts
                info['error'] = "invalid_cut"
                reward = -2

        # check if replenishment is required
        if replenishment_needed:
            self.inventory.replenish_piece()
            # assign negative reward if a piece has to be replenished
            reward = -1

        # compile env state
        maze_state = self.get_maze_state()

        return maze_state, reward, False, info

    def get_maze_state(self) -> Cutting2DMazeState:
        """Returns the current Cutting2DMazeState of the environment."""
        return Cutting2DMazeState(self.inventory.pieces, self.max_pieces_in_inventory,
                                  self.current_demand, self.raw_piece_size)

    def reset(self) -> Cutting2DMazeState:
        """Resets the environment to initial state."""
        self._setup_env()
        return self.get_maze_state()

    def close(self):
        """No additional cleanup necessary."""

    def seed(self, seed: int) -> None:
        """Seed random state of environment."""
        # No randomness in the env at this point
        pass

    # --- lets ignore everything below this line for now ---

    def get_renderer(self) -> Any:
        pass

    def get_serializable_components(self) -> Dict[str, Any]:
        pass

    def is_actor_done(self) -> bool:
        pass

    def actor_id(self) -> ActorID:
        pass

    def agent_counts_dict(self) -> Dict[Union[str, int], int]:
        pass
Environment Components

To keep the implementation of the core environment short and clean we introduces a dedicated Inventory class providing functionality for:

  • maintaining the inventory of available cutting pieces

  • replenishing new raw inventory pieces if required

  • the cutting logic of the environment

env/inventory.py
from .maze_action import Cutting2DMazeAction


class Inventory:
    """Holds the inventory of 2D pieces and performs cutting.
    :param max_pieces_in_inventory: Size of the inventory. If full, the oldest pieces get discarded.
    :param raw_piece_size: Size of a fresh raw (= full-size) piece.
    """

    def __init__(self, max_pieces_in_inventory: int, raw_piece_size: (int, int)):
        self.max_pieces_in_inventory = max_pieces_in_inventory
        self.raw_piece_size = raw_piece_size

        self.pieces = []

    # == Inventory management ==

    def is_full(self) -> bool:
        """Checks weather all slots in the inventory are in use."""
        return len(self.pieces) == self.max_pieces_in_inventory

    def store_piece(self, piece: (int, int)) -> None:
        """Store the given piece.
        :param piece: Piece to store.
        """
        # If we would run out of storage space, discard the oldest piece first
        if self.is_full():
            self.pieces.pop(0)
        self.pieces.append(piece)

    def replenish_piece(self) -> None:
        """Add a fresh raw piece to inventory."""
        self.store_piece(self.raw_piece_size)

    # == Cutting ==

    def cut(self, maze_action: Cutting2DMazeAction, ordered_piece: (int, int)) -> bool:
        """Attempt to perform the cutting. Remains of the cut piece are put back to inventory.

        :param maze_action: the cutting maze_action to perform
        :param ordered_piece: Dimensions of the piece that we should produce
        :return True if the cutting was successful, False on error.
        """
        if maze_action.rotate:
            ordered_piece = ordered_piece[::-1]

        # Check the piece ID is valid
        if maze_action.piece_id >= len(self.pieces):
            return False

        # Check whether the cut is possible
        if any([ordered_piece[dim] > available_size for dim, available_size
                in enumerate(self.pieces[maze_action.piece_id])]):
            return False

        # Perform the cut
        cutting_order = [1, 0] if maze_action.reverse_cutting_order else [0, 1]
        piece_to_cut = list(self.pieces.pop(maze_action.piece_id))
        for dim in cutting_order:
            residual = piece_to_cut.copy()
            residual[dim] = piece_to_cut[dim] - ordered_piece[dim]
            piece_to_cut[dim] = ordered_piece[dim]
            if residual[dim] > 0:
                self.store_piece(tuple(residual))

        return True

    # == State representation ==

    def size(self) -> int:
        """Current size of the inventory."""
        return len(self.pieces)
MazeState and MazeAction

As motivated and explained in more detail in our tutorial on Customizing Core and Maze Envs CoreEnvs rely on MazeState and MazeAction objects for interacting with an agent.

For the present case this is a Cutting2DMazeState

env/maze_state.py
class Cutting2DMazeState:
    """Cutting 2D environment MazeState representation.
    :param inventory: A list of pieces in inventory.
    :param max_pieces_in_inventory: Max number of pieces in inventory (inventory size).
    :param current_demand: Piece that should be produced in the next step.
    :param raw_piece_size: Size of a raw piece.
    """

    def __init__(self, inventory: [(int, int)], max_pieces_in_inventory: int,
                 current_demand: (int, int), raw_piece_size: (int, int)):
        self.inventory = inventory.copy()
        self.max_pieces_in_inventory = max_pieces_in_inventory
        self.current_demand = current_demand
        self.raw_piece_size = raw_piece_size

and a Cutting2DMazeAction defining which inventory piece to cut in which cutting order and orientation.

env/maze_action.py
class Cutting2DMazeAction:
    """Environment cutting MazeAction object.
    :param piece_id: ID of the piece to cut.
    :param rotate: Whether to rotate the ordered piece.
    :param reverse_cutting_order: Whether to cut along Y axis first (not X first as normal).
    """

    def __init__(self, piece_id: int, rotate: bool, reverse_cutting_order: bool):
        self.piece_id = piece_id
        self.rotate = rotate
        self.reverse_cutting_order = reverse_cutting_order

These two classes are utilized in the CoreEnv code above.

Test Script

The following snippet will instantiate the environment and run it for 15 steps.

main.py
""" Test script CoreEnv """
from tutorial_maze_env.part01_core_env.env.core_env import Cutting2DCoreEnvironment
from tutorial_maze_env.part01_core_env.env.maze_action import Cutting2DMazeAction


def main():
    # init and reset core environment
    core_env = Cutting2DCoreEnvironment(max_pieces_in_inventory=200, raw_piece_size=[100, 100],
                                        static_demand=(30, 15))
    maze_state = core_env.reset()
    # run interaction loop
    for i in range(15):
        # create cutting maze_action
        maze_action = Cutting2DMazeAction(piece_id=0, rotate=False, reverse_cutting_order=False)
        # take actual environment step
        maze_state, reward, done, info = core_env.step(maze_action)
        print(f"reward {reward} | done {done} | info {info}")


if __name__ == "__main__":
    """ main """
    main()

When running the script you should get the following command line output:

reward -1 | done False | info {'msg': 'valid_cut'}
reward 0 | done False | info {'msg': 'valid_cut'}
reward 0 | done False | info {'msg': 'valid_cut'}
...

Adding a Renderer

The complete code for this part of the tutorial can be found here

# file structure
- cutting_2d
    - main.py  # modified
    - env
        - core_env.py  # modified
        - inventory.py
        - maze_state.py
        - maze_action.py
        - renderer.py  # new
Renderer

To check whether our implementation of the environment works as expected and to later on observe how trained agents behave we add a Renderer as a next step in this tutorial.

For implementing the renderer we rely on matplotlib to ensure that it is compatible with the Maze Rollout Visualization Tools.

The Cutting2DRenderer will show the selected piece (the MazeAction) on the left, along with the current MazeState of the inventory on the right as shown here.

env/renderer.py
from typing import Tuple, Optional

import numpy as np
import matplotlib.patches as patches
import matplotlib.pyplot as plt

from maze.core.annotations import override
from maze.core.log_events.step_event_log import StepEventLog
from maze.core.rendering.renderer import Renderer
from .maze_action import Cutting2DMazeAction
from .maze_state import Cutting2DMazeState


class Cutting2DRenderer(Renderer):
    """Rendering class for the 2D cutting env.

    The ``Cutting2DRenderer`` will show the selected piece (the maze_action) on the left,
    plus the current state of the inventory on the right
    """

    @override(Renderer)
    def render(self, maze_state: Cutting2DMazeState, maze_action: Optional[Cutting2DMazeAction], events: StepEventLog) -> None:
        """
        Render maze_state and maze_action of the cutting 2D env.

        :param maze_state: MazeState to render
        :param maze_action: MazeAction to render
        :param events: Events logged during the step (not used)
        """

        plt.figure("Cutting 2D", figsize=(8, 4))
        plt.clf()

        # The maze_action taken

        plt.subplot(121, aspect='equal')
        if maze_action is not None:
            self._plot_maze_action(maze_action, "MazeAction", maze_state)
        else:
            self._add_title("MazeAction (none)")

        # The inventory state
        plt.subplot(122, aspect='equal')
        self._plot_inventory(maze_state, maze_action)

        plt.tight_layout()
        plt.draw()
        plt.pause(0.1)

    def _plot_maze_action(self, maze_action: Cutting2DMazeAction, title: str, maze_state: Cutting2DMazeState):
        piece_to_cut = maze_state.inventory[maze_action.piece_id]
        if maze_action.rotate:
            piece_to_cut = piece_to_cut[::-1]

        plt.xlim([0, maze_state.raw_piece_size[0]])
        plt.ylim([0, maze_state.raw_piece_size[1]])

        self._draw_piece(piece_to_cut)
        self._draw_piece(maze_state.current_demand, highlight=True)
        self._draw_cutting_lines(maze_state.current_demand, piece_to_cut, maze_action.reverse_cutting_order)
        self._add_title(title)

    def _plot_inventory(self, maze_state: Cutting2DMazeState, maze_action: Cutting2DMazeAction):

        # plot inventory pieces
        inventory_piece_dims = np.vstack(maze_state.inventory)
        inventory_piece_dims = np.sort(inventory_piece_dims, axis=1)
        plt.plot(inventory_piece_dims[:, 0], inventory_piece_dims[:, 1], "ko",
                 alpha=0.5, label="inventory pieces")
        # plot current demand
        current_demand = sorted(maze_state.current_demand)
        plt.plot(current_demand[0], current_demand[1], "o",
                 color=(0.7, 0.2, 0.2), alpha=0.75, label="current demand")
        # plot maze_action
        piece_to_cut = inventory_piece_dims[maze_action.piece_id]
        plt.plot(piece_to_cut[0], piece_to_cut[1], "bo",
                 alpha=0.75, label="cutting inventory piece")
        plt.grid()
        plt.legend()
        plt.axis("equal")
        self._add_title("Inventory Pieces")

    @staticmethod
    def _draw_piece(piece: Tuple[int, int], highlight: bool = False):
        plt.gca().add_patch(patches.Rectangle((0, 0), piece[0], piece[1],
                                              facecolor=(0.7, 0.2, 0.2) if highlight else (0.8, 0.8, 0.8)))

    @staticmethod
    def _add_title(title: str):
        plt.title(title, fontdict=dict(fontsize=16, fontweight='bold', horizontalalignment='left'), loc='left')

    @staticmethod
    def _draw_cutting_lines(ordered_piece: Tuple[int, int], piece_to_cut: Tuple[int, int], reverse_cutting_order: bool):
        """Draw the cutting lines.

        :param ordered_piece: Size of the ordered piece
        :param piece_to_cut: Piece which we are cutting
        :param reverse_cutting_order: If we should cut along Y axis first (instead of X first)
        """

        if reverse_cutting_order:
            h_x = (0, piece_to_cut[0])
            h_y = (ordered_piece[1], ordered_piece[1])
            v_x = (ordered_piece[0], ordered_piece[0])
            v_y = (0, ordered_piece[1])
        else:
            h_x = (0, ordered_piece[0])
            h_y = (ordered_piece[1], ordered_piece[1])
            v_x = (ordered_piece[0], ordered_piece[0])
            v_y = (0, piece_to_cut[1])

        plt.plot(h_x, h_y, color='black', linestyle="--")
        plt.plot(v_x, v_y, color='black', linestyle="--")
Updating the CoreEnv

To make use of the renderer we simple have to instantiate it in the constructor of the CoreEnv and make it accessible via the get_renderer() method.

env/core_env.py
from .renderer import Cutting2DRenderer
...


class Cutting2DCoreEnvironment(CoreEnv):

    def __init__(self, max_pieces_in_inventory: int, raw_piece_size: (int, int), static_demand: (int, int)):
        super().__init__()

        # initialize rendering
        self.renderer = Cutting2DRenderer()
        ...

    def get_renderer(self) -> Cutting2DRenderer:
        """Cutting 2D renderer module."""
        return self.renderer
Test Script

The following snippet will instantiate the environment and run it for 15 steps.

main.py
""" Test script CoreEnv """
from tutorial_maze_env.part02_renderer.env.core_env import Cutting2DCoreEnvironment
from tutorial_maze_env.part02_renderer.env.maze_action import Cutting2DMazeAction


def main():
    # init and reset core environment
    core_env = Cutting2DCoreEnvironment(max_pieces_in_inventory=200, raw_piece_size=[100, 100],
                                        static_demand=(30, 15))
    maze_state = core_env.reset()
    # run interaction loop
    for i in range(15):
        # create cutting maze_action
        maze_action = Cutting2DMazeAction(piece_id=0, rotate=False, reverse_cutting_order=False)

        # render current state along with next maze_action
        core_env.renderer.render(maze_state, maze_action, None)

        # take actual environment step
        maze_state, reward, done, info = core_env.step(maze_action)
        print(f"reward {reward} | done {done} | info {info}")


if __name__ == "__main__":
    """ main """
    main()

When running the script you should get the following command line output:

reward -1 | done False | info {'msg': 'valid_cut'}
reward 0 | done False | info {'msg': 'valid_cut'}
reward 0 | done False | info {'msg': 'valid_cut'}
...

and a rendering of the current MazeState and MazeAction in each time step similar to the image shown below:

_images/cutting_2d_rendering.png

The dashed line represents the cutting configuration specified with the MazeAction.

Implementing the MazeEnv

The complete code for this part of the tutorial can be found here

# file structure
- cutting_2d
    - main.py  # modified
    - env
        - core_env.py  # modified
        - inventory.py
        - maze_state.py
        - maze_action.py
        - renderer.py
        - maze_env.py  # new
    - space_interfaces
        - dict_action_conversion.py  # new
        - dict_observation_conversion.py  # new
MazeEnv

The MazeEnv wraps the CoreEnvs as a Gym-style environment in a reusable form, by utilizing the interfaces (mappings) from the MazeState to the observation and from the MazeAction to the action. After implementing the MazeEnv we will be ready to perform our first training run. To learn more about the usability and advantages of this concept you can follow up on Customizing Core and Maze Envs.

In the remainder of this part of the tutorial we will implement the Cutting2DEnvironment (MazeEnv) as well as a corresponding set of interfaces.

env/maze_env.py
from maze.core.env.core_env import CoreEnv
from maze.core.env.maze_env import MazeEnv
from maze.core.env.action_conversion import ActionConversionInterface
from maze.core.env.observation_conversion import ObservationConversionInterface

from .core_env import Cutting2DCoreEnvironment
from ..space_interfaces.dict_observation_conversion import ObservationConversion
from ..space_interfaces.dict_action_conversion import ActionConversion


class Cutting2DEnvironment(MazeEnv[Cutting2DCoreEnvironment]):
    """Maze environment for 2d cutting.

    :param core_env: The underlying core environment.
    :param action_conversion: A action conversion interfaces.
    :param observation_conversion: An observation conversion interface.
    """

    def __init__(self,
                 core_env: CoreEnv,
                 action_conversion: ActionConversionInterface,
                 observation_conversion: ObservationConversionInterface):
        super().__init__(core_env=core_env,
                         action_conversion_dict={0: action_conversion},
                         observation_conversion_dict={0: observation_conversion})


def maze_env_factory(max_pieces_in_inventory: int, raw_piece_size: (int, int),
                     static_demand: (int, int)) -> Cutting2DEnvironment:
    """Convenience factory function that compiles a trainable maze environment.
    (for argument details see: Cutting2DCoreEnvironment)
    """

    # init core environment
    core_env = Cutting2DCoreEnvironment(max_pieces_in_inventory=max_pieces_in_inventory,
                                        raw_piece_size=raw_piece_size,
                                        static_demand=static_demand)

    # init maze environment including observation and action interfaces
    action_conversion = ActionConversion(max_pieces_in_inventory=max_pieces_in_inventory)
    observation_conversion = ObservationConversion(raw_piece_size=raw_piece_size,
                                                   max_pieces_in_inventory=max_pieces_in_inventory)
    return Cutting2DEnvironment(core_env, action_conversion, observation_conversion)

The MazeEnv is instantiated with the underlying CoreEnv and the two interfaces for MazeStates and MazeActions. For convenience we also add a maze_env_factory to instantiate the MazeEnv from the original environment parameter set. This will be useful in the next part of the tutorial where we will train an agent based on this environment.

ObservationConversionInterface

The ObservationConversionInterface converts CoreEnv MazeState objects into machine readable Gym-style observations and defines the respective Gym observation space. In the present cases the observation is defined as a dictionary with the following structure:

  • inventory: 2d array representing all pieces currently in inventory

  • inventory_size: count of pieces currently in inventory

  • order: 2d vector representing the customer order (current demand)

space_interfaces/dict_observation_conversion.py
import numpy as np
from typing import Dict
from gym import spaces

from maze.core.annotations import override
from maze.core.env.observation_conversion import ObservationConversionInterface
from ..env.maze_state import Cutting2DMazeState


class ObservationConversion(ObservationConversionInterface):
    """Cutting 2d environment state to dictionary observation.

    :param max_pieces_in_inventory: Size of the inventory. If inventory gets full, the oldest pieces get discarded.
    :param raw_piece_size: Size of a fresh raw (= full-size) piece
    """

    def __init__(self, raw_piece_size: (int, int), max_pieces_in_inventory: int):
        self.max_pieces_in_inventory = max_pieces_in_inventory
        self.raw_piece_size = raw_piece_size

    @override(ObservationConversionInterface)
    def maze_to_space(self, maze_state: Cutting2DMazeState) -> Dict[str, np.ndarray]:
        """Converts core environment state to a machine readable agent observation."""

        # Convert inventory to numpy array and stretch it to full size (filling with zeros)
        inventory_state = maze_state.inventory
        inventory_state += [(0, 0)] * (self.max_pieces_in_inventory - len(maze_state.inventory))

        # Compile dict space observation
        return {'inventory': np.asarray(inventory_state, dtype=np.float32),
                'inventory_size': np.asarray([len(maze_state.inventory)], dtype=np.float32),
                'ordered_piece': np.asarray(maze_state.current_demand, dtype=np.float32)}

    @override(ObservationConversionInterface)
    def space_to_maze(self, observation: Dict[str, np.ndarray]) -> Cutting2DMazeState:
        """Converts agent observation to core environment state (not required for this example)."""
        raise NotImplementedError

    @override(ObservationConversionInterface)
    def space(self) -> spaces.Dict:
        """Return the Gym dict observation space based on the given params.

        :return: Gym space object
            - inventory: max_pieces_in_inventory x 2 (x/y-dimensions of pieces in inventory)
            - inventory_size: scalar number of pieces in inventory
            - ordered_piece: 2d vector holding x/y-dimension of customer ordered piece
        """
        return spaces.Dict({
            'inventory': spaces.Box(low=np.zeros((self.max_pieces_in_inventory, 2), dtype=np.float32),
                                    high=np.vstack([[self.raw_piece_size[0] + 1, self.raw_piece_size[1] + 1]] *
                                                   self.max_pieces_in_inventory).astype(np.float32),
                                    dtype=np.float32),
            'inventory_size': spaces.Box(low=np.float32(0), high=self.max_pieces_in_inventory + 1,
                                         shape=(1,), dtype=np.float32),
            'ordered_piece': spaces.Box(low=np.float32(0), high=np.float32(max(self.raw_piece_size) + 1),
                                         shape=(2,), dtype=np.float32)
        })
ActionConversionInterface

The ActionConversionInterface converts agent actions into CoreEnv MazeAction objects and defines the respective Gym action space. In the present cases the action is defined as a dictionary with the following structure:

  • piece_idx: id of the inventory piece that should be used for cutting

  • rotation: defines whether to rotate the piece for cutting or not

  • order: defines the cutting order (xy vs. yx)

space_interfaces/dict_action_conversion.py
from typing import Dict
from gym import spaces
from maze.core.env.action_conversion import ActionConversionInterface

from ..env.maze_action import Cutting2DMazeAction
from ..env.maze_state import Cutting2DMazeState


class ActionConversion(ActionConversionInterface):
    """Converts agent actions to actual environment maze_actions.

    :param max_pieces_in_inventory: Size of the inventory
    """

    def __init__(self, max_pieces_in_inventory: int):
        self.max_pieces_in_inventory = max_pieces_in_inventory

    def space_to_maze(self, action: Dict[str, int], maze_state: Cutting2DMazeState) -> Cutting2DMazeAction:
        """Converts agent dictionary action to environment MazeAction object."""
        return Cutting2DMazeAction(piece_id=action["piece_idx"],
                                  rotate=bool(action["cut_rotation"]),
                                  reverse_cutting_order=bool(action["cut_order"]))

    def maze_to_space(self, maze_action: Cutting2DMazeAction) -> Dict[str, int]:
        """Converts environment MazeAction object to agent dictionary action."""
        return {"piece_idx": maze_action.piece_id,
                "cut_rotation": int(maze_action.rotate),
                "cut_order": int(maze_action.reverse_cutting_order)}

    def space(self) -> spaces.Dict:
        """Returns Gym dict action space."""
        return spaces.Dict({
            "piece_idx": spaces.Discrete(self.max_pieces_in_inventory),  # Which piece should be cut
            "cut_rotation": spaces.Discrete(2),  # Rotate: (yes / no)
            "cut_order": spaces.Discrete(2)      # Cutting order: (xy / yx)
        })
Updating the CoreEnv

For the sake of completeness we also show two more minor modifications required in the CoreEnv, which are not too important for this tutorial at the moment. In short, the StructuredEnv interface supports interaction patterns beyond standard Gym environments to model for example hierarchical or multi-agent RL problems. We will get back to this in our more advanced tutorials.

The code below defines that the current version of the environment requires only one actor (id 0) with a single policy (id 0) that is never done.

env/core_env.py
from maze.core.env.structured_env import ActorID


class Cutting2DCoreEnvironment(CoreEnv):

    ...

    def is_actor_done(self) -> bool:
        """Returns True if the just stepped actor is done, which is different to the done flag of the environment."""
        return False

    def actor_id(self) -> ActorID:
        """Returns the currently executed actor along with the policy id. The id is unique only with
        respect to the policies (every policy has its own actor 0).
        Note that identities of done actors can not be reused in the same rollout.

        :return: The current actor, as tuple (policy id, actor number).
        """
        return ActorID(step_key=0, agent_id=0)

    ...
Test Script

The following snippet will instantiate the environment and run it for 15 steps.

Note that (compared to the previous example) we are now:

  • working with observations and actions instead of MazeStates and MazeActions

  • able to sample actions from the action_space object

main.py
""" Test script CoreEnv """
from tutorial_maze_env.part03_maze_env.env.maze_env import maze_env_factory


def main():
    # init maze environment including observation and action interfaces
    env = maze_env_factory(max_pieces_in_inventory=10,
                           raw_piece_size=[100, 100],
                           static_demand=(30, 15))

    # reset environment
    obs = env.reset()
    # run interaction loop
    for i in range(15):
        # sample random action
        action = env.action_space.sample()

        # take actual environment step
        obs, reward, done, info = env.step(action)
        print(f"reward {reward} | done {done} | info {info}")


if __name__ == "__main__":
    """ main """
    main()
reward -1 | done False | info {'msg': 'valid_cut'}
reward 0 | done False | info {'msg': 'valid_cut'}
reward 0 | done False | info {'msg': 'valid_cut'}
reward 0 | done False | info {'error': 'piece_id_out_of_bounds'}
reward 0 | done False | info {'error': 'piece_id_out_of_bounds'}
...

Training the MazeEnv

The complete code for this part of the tutorial can be found here

# file structure
- cutting_2d
    - main.py
    - env ...
    - space_interfaces ...
    - conf
        - env
            - tutorial_cutting_2d_basic.yaml  # new
        - model
            - tutorial_cutting_2d_basic.yaml  # new
        - wrappers
            - tutorial_cutting_2d_basic.yaml  # new

Note

Hydra only accepts .yaml as file extension.

Hydra Configuration

The entire Maze workflow is boosted by the Hydra configuration system. To be able to perform our first training run via the Maze CLI we have to add a few more config files. Going into the very details of the config structure is for now beyond the scope of this tutorial. However, we still provide some information on the parts relevant for this example.

The config file for the maze_env_factory looks as follows:

conf/env/tutorial_cutting_2d_basic.yaml
# @package env
_target_: tutorial_maze_env.part03_maze_env.env.maze_env.maze_env_factory

# parametrizes the core environment
max_pieces_in_inventory: 200
raw_piece_size: [100, 100]
static_demand: [30, 15]

Additionally, we also provide a wrapper config but refer to Customizing Environments with Wrappers for details.

conf/wrappers/tutorial_cutting_2d_basic.yaml
# @package wrappers

# limits the maximum number of time steps of an episode
maze.core.wrappers.time_limit_wrapper.TimeLimitWrapper:
  max_episode_steps: 200

# flattens the dictionary observations to work with DenseLayers
maze.core.wrappers.observation_preprocessing.preprocessing_wrapper.PreProcessingWrapper:
  pre_processor_mapping:
    - observation: inventory
      _target_: maze.preprocessors.FlattenPreProcessor
      keep_original: false
      config:
        num_flatten_dims: 2

# monitoring wrapper
maze.core.wrappers.monitoring_wrapper.MazeEnvMonitoringWrapper:
  observation_logging: false
  action_logging: true
  reward_logging: false

To learn more about the model config in conf/env_model/tutorial_cutting_2d_basic.yaml you can visit the introduction on how to work with template models.

Training an Agent

Once the config is set up we are good to go to start our first training run (in the cmd below with the PPO algorithm) via the CLI with

maze-run -cn conf_train env=tutorial_cutting_2d_basic wrappers=tutorial_cutting_2d_basic \
model=tutorial_cutting_2d_basic algorithm=ppo

Running the trainer should print a command line output similar to the one shown below.

 step|path                                                                        |               value
=====|============================================================================|====================
   12|train     MultiStepActorCritic..time_epoch            ······················|              24.333
   12|train     MultiStepActorCritic..time_rollout          ······················|               0.754
   12|train     MultiStepActorCritic..learning_rate         ······················|               0.000
   12|train     MultiStepActorCritic..policy_loss           0                     |              -0.016
   12|train     MultiStepActorCritic..policy_grad_norm      0                     |               0.015
   12|train     MultiStepActorCritic..policy_entropy        0                     |               0.686
   12|train     MultiStepActorCritic..critic_value          0                     |             -56.659
   12|train     MultiStepActorCritic..critic_value_loss     0                     |              33.026
   12|train     MultiStepActorCritic..critic_grad_norm      0                     |               0.500
   12|train     MultiStepActorCritic..time_update           ······················|               1.205
   12|train     DiscreteActionEvents  action                substep_0/order       |   [len:8000, μ:0.5]
   12|train     DiscreteActionEvents  action                substep_0/piece_idx   | [len:8000, μ:169.2]
   12|train     DiscreteActionEvents  action                substep_0/rotation    |   [len:8000, μ:1.0]
   12|train     BaseEnvEvents         reward                median_step_count     |             200.000
   12|train     BaseEnvEvents         reward                mean_step_count       |             200.000
   12|train     BaseEnvEvents         reward                total_step_count      |           96000.000
   12|train     BaseEnvEvents         reward                total_episode_count   |             480.000
   12|train     BaseEnvEvents         reward                episode_count         |              40.000
   12|train     BaseEnvEvents         reward                std                   |              34.248
   12|train     BaseEnvEvents         reward                mean                  |            -186.450
   12|train     BaseEnvEvents         reward                min                   |            -259.000
   12|train     BaseEnvEvents         reward                max                   |            -130.000

To get a nicer view on these numbers we can also take a look at the stats with Tensorboard.

tensorboard --logdir outputs

You can view it with your browser at http://localhost:6006/.

_images/tb_maze_env.png

For now we can only inspect standard metrics such as reward statistics or mean_step_counts per episode. Unfortunately, this is not too informative with respect to the cutting problem we are currently addressing. In the next part we will show how to make logging much more informative by introducing events and KPIs.

Adding Events and KPIs

The complete code for this part of the tutorial can be found here

# file structure
- cutting_2d
    - main.py  # modified
    - env
        - core_env.py  # modified
        - inventory.py  # modified
        - maze_state.py
        - maze_action.py
        - renderer.py
        - maze_env.py
        - events.py  # new
        - kpi_calculator.py  # new
    - space_interfaces
        - dict_action_conversion.py
        - dict_observation_conversion.py
    - conf ...
Events

In the previous section we have trained the initial version of our cutting environment and already learned how we can watch the training process with commandline and Tensorboard logging. However, watching only standard metrics such as reward or episode step count is not always too informative with respect to the agents behaviour and the problem at hand.

For example we might be interested in how often an agent selects an invalid cutting piece or specifies and invalid cutting setting. To tackle this issue and to enable better inspection and logging tools we introduce an event system that will be also reused in the reward customization section of this tutorial.

In particular, we introduce two event types related to the cutting process as well as inventory management. For each event we can define which statistics are computed at which stage of the aggregation process (event, step, epoch) via event decorators:

  • @define_step_stats(len): Events \(e_i\) are collected as a list of events \(\{e_i\}\). The len function counts how often such an event occurred in the current environment step \(Stats_{Step}=|\{e_i\}|\).

  • @define_episode_stats(sum): Defines how the \(S\) step statistics should be aggregated to episode statistics by simply summing them up: \(Stats_{Episode}=\sum^S Stats_{Step}\)

  • @define_epoch_stats(np.mean, output_name="mean_episode_total"): A training epoch consists of N episodes. This decorator defines that epoch statistics should be the average of the contained episodes: \(Stats_{Epoch}=(\sum^N Stats_{Episode})/N\)

Below we will see that theses statistics will now be considered by the logging system as InventoryEvents and CuttingEvents. For more details on event decorators and the underlying working principles we refer to the dedicated section on event and KPI logging.

env/events.py
from abc import ABC

import numpy as np
from maze.core.log_stats.event_decorators import define_step_stats, define_episode_stats, define_epoch_stats


class CuttingEvents(ABC):
    """Events related to the cutting process."""

    @define_epoch_stats(np.mean, output_name="mean_episode_total")
    @define_episode_stats(sum)
    @define_step_stats(len)
    def invalid_piece_selected(self):
        """An invalid piece is selected for cutting."""

    @define_epoch_stats(np.mean, output_name="mean_episode_total")
    @define_episode_stats(sum)
    @define_step_stats(len)
    def valid_cut(self, current_demand: (int, int), piece_to_cut: (int, int), raw_piece_size: (int, int),
                  cutting_area: float):
        """A valid cut was performed."""

    @define_epoch_stats(np.mean, output_name="mean_episode_total")
    @define_episode_stats(sum)
    @define_step_stats(len)
    def invalid_cut(self, current_demand: (int, int), piece_to_cut: (int, int), raw_piece_size: (int, int)):
        """Invalid cutting parameters have been specified."""


class InventoryEvents(ABC):
    """Events related to inventory management."""

    @define_epoch_stats(np.mean, output_name="mean_episode_total")
    @define_episode_stats(sum)
    @define_step_stats(len)
    def piece_discarded(self, piece: (int, int)):
        """The inventory is full and a piece has been discarded."""

    @define_epoch_stats(np.mean, input_name="step_mean", output_name="step_mean")
    @define_epoch_stats(max, input_name="step_max", output_name="step_max")
    @define_episode_stats(np.mean, output_name="step_mean")
    @define_episode_stats(max, output_name="step_max")
    @define_step_stats(None)
    def pieces_in_inventory(self, value: int):
        """Reports the count of pieces currently in the inventory."""

    @define_epoch_stats(np.mean, output_name="mean_episode_total")
    @define_episode_stats(sum)
    @define_step_stats(len)
    def piece_replenished(self):
        """A new raw cutting piece has been replenished."""
KPI Calculator

The goal of the cutting 2d environment is to learn a cutting policy that requires as little as possible raw inventory pieces for fulfilling upcoming customer demand. This metric is exactly what we define as the KPI to watch and optimize, e.g. the raw_piece_usage_per_step.

As you will see below the logging system considers such KPIs and prints statistics of these along with the remaining BaseEnvEvents.

env/kpi_calculator.py
from typing import Dict

from maze.core.env.maze_state import MazeStateType
from maze.core.log_events.kpi_calculator import KpiCalculator
from maze.core.log_events.episode_event_log import EpisodeEventLog
from .events import InventoryEvents


class Cutting2dKpiCalculator(KpiCalculator):
    """KPIs for 2D cutting environment.
    The following KPIs are available: Raw pieces used per step
    """

    def calculate_kpis(self, episode_event_log: EpisodeEventLog, last_maze_state: MazeStateType) -> Dict[str, float]:
        """Calculates the KPIs at the end of episode."""

        # get overall step count of episode
        step_count = len(episode_event_log.step_event_logs)
        # count raw inventory piece replenishment events
        raw_piece_usage = 0
        for _ in episode_event_log.query_events(InventoryEvents.piece_replenished):
            raw_piece_usage += 1
        # compute step normalized raw piece usage
        return {"raw_piece_usage_per_step": raw_piece_usage / step_count}
Updating CoreEnv and Inventory

There are also a few changes we have to make in the CoreEnvironment:

  • initialize the Publisher-Subscriber and the KPI Calculator

  • creating the event topics for cutting and inventory events when setting up the environment

  • instead of writing relevant events into the info dictionary in the step function we can now trigger the respective events.

env/core_env.py
...
from maze.core.events.pubsub import Pubsub
from .events import CuttingEvents, InventoryEvents
from .kpi_calculator import Cutting2dKpiCalculator


class Cutting2DCoreEnvironment(CoreEnv):

    def __init__(self, max_pieces_in_inventory: int, raw_piece_size: (int, int), static_demand: (int, int)):
        super().__init__()

        ...

        # init pubsub for event to reward routing
        self.pubsub = Pubsub(self.context.event_service)

        # KPIs calculation
        self.kpi_calculator = Cutting2dKpiCalculator()

    def _setup_env(self):
        """Setup environment."""
        inventory_events = self.pubsub.create_event_topic(InventoryEvents)
        self.inventory = Inventory(self.max_pieces_in_inventory, self.raw_piece_size, inventory_events)
        self.inventory.replenish_piece()

        self.cutting_events = self.pubsub.create_event_topic(CuttingEvents)

    def step(self, maze_action: Cutting2DMazeAction) -> Tuple[Cutting2DMazeState, np.array, bool, Dict[Any, Any]]:
        """Summary of the step (simplified, not necessarily respecting the actual order in the code):
        1. Check if the selected piece to cut is valid (i.e. in inventory, large enough etc.)
        2. Attempt the cutting
        3. Replenish a fresh piece if needed and return an appropriate reward

        :param maze_action: Cutting MazeAction to take.
        :return: maze_state, reward, done, info
        """

        info, reward = {}, 0
        replenishment_needed = False

        # check if valid piece id was selected
        if maze_action.piece_id >= self.inventory.size():
            self.cutting_events.invalid_piece_selected()
        # perform cutting
        else:
            piece_to_cut = self.inventory.pieces[maze_action.piece_id]

            # attempt the cut
            if self.inventory.cut(maze_action, self.current_demand):
                self.cutting_events.valid_cut(current_demand=self.current_demand, piece_to_cut=piece_to_cut,
                                              raw_piece_size=self.raw_piece_size)
                replenishment_needed = piece_to_cut == self.raw_piece_size
            else:
                # assign a negative reward for invalid cutting attempts
                self.cutting_events.invalid_cut(current_demand=self.current_demand, piece_to_cut=piece_to_cut,
                                                raw_piece_size=self.raw_piece_size)
                reward = -2

        # check if replenishment is required
        if replenishment_needed:
            self.inventory.replenish_piece()
            # assign negative reward if a piece has to be replenished
            reward = -1

        # step execution finished, write step statistics
        self.inventory.log_step_statistics()

        # compile env state
        maze_state = self.get_maze_state()

        return maze_state, reward, False, info

    def get_kpi_calculator(self) -> Cutting2dKpiCalculator:
        """KPIs are supported."""
        return self.kpi_calculator

For the inventory we proceed analogously and also trigger the respective events.

env/inventory.py
...
from .events import InventoryEvents


class Inventory:
    """Holds the inventory of 2D pieces and performs cutting.
    :param max_pieces_in_inventory: Size of the inventory. If full, the oldest pieces get discarded.
    :param raw_piece_size: Size of a fresh raw (= full-size) piece.
    :param inventory_events: Inventory event dispatch proxy.
    """

    def __init__(self, max_pieces_in_inventory: int, raw_piece_size: (int, int),
                 inventory_events: InventoryEvents):
        ...

        self.inventory_events = inventory_events

    def store_piece(self, piece: (int, int)) -> None:
        """Store the given piece.
        :param piece: Piece to store.
        """
        # If we would run out of storage space, discard the oldest piece first
        if self.is_full():
            self.pieces.pop(0)
            self.inventory_events.piece_discarded(piece=piece)

        self.pieces.append(piece)

    def replenish_piece(self) -> None:
        """Add a fresh raw piece to inventory."""
        self.store_piece(self.raw_piece_size)
        self.inventory_events.piece_replenished()

    def log_step_statistics(self):
        """Log inventory statistics once per step"""
        self.inventory_events.pieces_in_inventory(self.size())
Test Script

The following snippet will instantiate the environment and run it for 15 steps.

To get access to event and KPI logging we need to wrap the environment with the LogStatsWrapper. To simplify the statistics logging setup we rely on the SimpleStatsLoggingSetup helper class.

main.py
""" Test script CoreEnv """
from maze.utils.log_stats_utils import SimpleStatsLoggingSetup
from maze.core.wrappers.log_stats_wrapper import LogStatsWrapper
from tutorial_maze_env.part04_events.env.maze_env import maze_env_factory


def main():
    # init maze environment including observation and action interfaces
    env = maze_env_factory(max_pieces_in_inventory=200,
                           raw_piece_size=[100, 100],
                           static_demand=(30, 15))

    # wrap environment with logging wrapper
    env = LogStatsWrapper(env, logging_prefix="main")

    # register a console writer and connect the writer to the statistics logging system
    with SimpleStatsLoggingSetup(env):
        # reset environment
        obs = env.reset()
        # run interaction loop
        for i in range(15):
            # sample random action
            action = env.action_space.sample()

            # take actual environment step
            obs, reward, done, info = env.step(action)


if __name__ == "__main__":
    """ main """
    main()

When running the script you will get an output as shown below. Note that statistics of both, events and KPIs, are printed along with default reward or action statistics.

 step|path                                                                      |               value
=====|==========================================================================|====================
    1|main    DiscreteActionEvents  action                substep_0/order       |     [len:15, μ:0.5]
    1|main    DiscreteActionEvents  action                substep_0/piece_idx   |    [len:15, μ:82.3]
    1|main    DiscreteActionEvents  action                substep_0/rotation    |     [len:15, μ:0.7]
    1|main    BaseEnvEvents         reward                median_step_count     |              15.000
    1|main    BaseEnvEvents         reward                mean_step_count       |              15.000
    1|main    BaseEnvEvents         reward                total_step_count      |              15.000
    1|main    BaseEnvEvents         reward                total_episode_count   |               1.000
    1|main    BaseEnvEvents         reward                episode_count         |               1.000
    1|main    BaseEnvEvents         reward                std                   |               0.000
    1|main    BaseEnvEvents         reward                mean                  |             -29.000
    1|main    BaseEnvEvents         reward                min                   |             -29.000
    1|main    BaseEnvEvents         reward                max                   |             -29.000
    1|main    InventoryEvents       piece_replenished     mean_episode_total    |               3.000
    1|main    InventoryEvents       pieces_in_inventory   step_max              |             200.000
    1|main    InventoryEvents       pieces_in_inventory   step_mean             |             200.000
    1|main    CuttingEvents         invalid_cut           mean_episode_total    |              14.000
    1|main    InventoryEvents       piece_discarded       mean_episode_total    |               2.000
    1|main    CuttingEvents         valid_cut             mean_episode_total    |               1.000
    1|main    BaseEnvEvents         kpi                   max/raw_piece_usage_..|               0.000
    1|main    BaseEnvEvents         kpi                   min/raw_piece_usage_..|               0.000
    1|main    BaseEnvEvents         kpi                   std/raw_piece_usage_..|               0.000
    1|main    BaseEnvEvents         kpi                   mean/raw_piece_usage..|               0.000

Training with Events and KPIs

The complete code for this part of the tutorial can be found here

# file structure
- cutting_2d
    - main.py
    - env ...
    - space_interfaces ...
    - conf
        - env
            - tutorial_cutting_2d_events.yaml  # new
        - model
            - tutorial_cutting_2d_events.yaml  # new
        - wrappers
            - tutorial_cutting_2d_events.yaml  # new
Hydra Configuration

The entire structure of this example is identical to the one on training the MazeEnv. Everything regarding the event systems was already changed in the section on adding events and KPIs and the trainers will consider these changes implicitly.

Training an Agent

To retrain the agent on the environment extended with event and KPI logging, run

maze-run -cn conf_train env=tutorial_cutting_2d_events wrappers=tutorial_cutting_2d_events \
model=tutorial_cutting_2d_events algorithm=ppo

Running the trainer should print an extended command line output similar to the one shown below. In addition to base events we now also get a statistics log of CuttingEvents, InventoryEvents and KPIs.

 step|path                                                                        |               value
=====|============================================================================|====================
    6|train     MultiStepActorCritic..time_epoch            ······················|              24.548
    6|train     MultiStepActorCritic..time_rollout          ······················|               0.762
    6|train     MultiStepActorCritic..learning_rate         ······················|               0.000
    6|train     MultiStepActorCritic..policy_loss           0                     |              -0.020
    6|train     MultiStepActorCritic..policy_grad_norm      0                     |               0.013
    6|train     MultiStepActorCritic..policy_entropy        0                     |               0.760
    6|train     MultiStepActorCritic..critic_value          0                     |             -49.238
    6|train     MultiStepActorCritic..critic_value_loss     0                     |              50.175
    6|train     MultiStepActorCritic..critic_grad_norm      0                     |               0.500
    6|train     MultiStepActorCritic..time_update           ······················|               1.210
    6|train     DiscreteActionEvents  action                substep_0/order       |   [len:8000, μ:0.0]
    6|train     DiscreteActionEvents  action                substep_0/piece_idx   | [len:8000, μ:174.2]
    6|train     DiscreteActionEvents  action                substep_0/rotation    |   [len:8000, μ:1.0]
    6|train     BaseEnvEvents         reward                median_step_count     |             200.000
    6|train     BaseEnvEvents         reward                mean_step_count       |             200.000
    6|train     BaseEnvEvents         reward                total_step_count      |           48000.000
    6|train     BaseEnvEvents         reward                total_episode_count   |             240.000
    6|train     BaseEnvEvents         reward                episode_count         |              40.000
    6|train     BaseEnvEvents         reward                std                   |              38.427
    6|train     BaseEnvEvents         reward                mean                  |            -182.175
    6|train     BaseEnvEvents         reward                min                   |            -323.000
    6|train     BaseEnvEvents         reward                max                   |            -119.000
    6|train     InventoryEvents       piece_replenished     mean_episode_total    |              15.325
    6|train     InventoryEvents       piece_discarded       mean_episode_total    |              67.400
    6|train     InventoryEvents       pieces_in_inventory   step_max              |             200.000
    6|train     InventoryEvents       pieces_in_inventory   step_mean             |             200.000
    6|train     CuttingEvents         valid_cut             mean_episode_total    |             116.075
    6|train     CuttingEvents         invalid_cut           mean_episode_total    |              83.925
    6|train     BaseEnvEvents         kpi                   max/raw_piece_usage_..|               0.135
    6|train     BaseEnvEvents         kpi                   min/raw_piece_usage_..|               0.020
    6|train     BaseEnvEvents         kpi                   std/raw_piece_usage_..|               0.028
    6|train     BaseEnvEvents         kpi                   mean/raw_piece_usage..|               0.077

Of course these changes are also reflected in the Tensorboard log which you can again view with your browser at http://localhost:6006/.

tensorboard --logdir outputs

As you can see we now have the two additional sections train_CuttingEvents and train_InventoryEvents available.

_images/tb_event_sections.png

A closer look at these events reveals that the agent actually starts to learn something meaning full as the number of invalid cuts decreases which of course implies that the number of valid cuts increases and we are able to full fill the current customer demand.

_images/tb_event_details.png

Adding Reward Customization

The complete code for this part of the tutorial can be found here

# file structure
- cutting_2d
    - main.py  # modified
    - env
        - core_env.py  # modified
        - inventory.py
        - maze_state.py
        - maze_action.py
        - renderer.py
        - maze_env.py  # modified
        - events.py
        - kpi_calculator.py
    - space_interfaces
        - dict_action_conversion.py
        - dict_observation_conversion.py
    - reward
        - default_reward.py  # new
Reward

In this part of the tutorial we introduce how to reuse the event system for reward shaping and customization via the RewardAggregatorInterface.

In Maze, reward aggregators usually calculate reward from the current environment state, events that happened during the last step, or a combination thereof. Calculating reward from state is generally simpler, but not a good fit for this environment – here, the reward is more concerned with what happened (was an invalid cut attempted? A new raw piece replenished?) than with the current state (i.e., the inventory state after the step). Hence, the reward calculation here is based on events (which is in general more flexible than using the environment state only).

The DefaultRewardAggregator does the following:

  • Requests the required event interfaces via get_interfaces (here CuttingEvents and InventoryEvents).

  • Collects rewards and penalties according to relevant events.

  • Aggregates the individual event rewards and penalties to a single scalar reward signal.

Note that this reward aggregator can have any form as long as it provides a scalar reward function that can be used for training. This gives a lot of flexibility in shaping rewards without the need to change the actual implementation of the environment (more on this topic).

reward/default_reward.py
from typing import List, Optional

from maze.core.annotations import override
from maze.core.env.maze_state import MazeStateType
from maze.core.env.reward import RewardAggregatorInterface
from ..env.events import CuttingEvents, InventoryEvents


class DefaultRewardAggregator(RewardAggregatorInterface):
    """Default reward scheme for the 2D cutting env.

    :param invalid_action_penalty: Negative reward assigned for an invalid cutting specification.
    :param raw_piece_usage_penalty: Negative reward assigned for starting a new raw inventory piece.
    """

    def __init__(self, invalid_action_penalty: float, raw_piece_usage_penalty: float):
        super().__init__()
        self.invalid_action_penalty = invalid_action_penalty
        self.raw_piece_usage_penalty = raw_piece_usage_penalty

    @override(RewardAggregatorInterface)
    def get_interfaces(self):
        """Specification of the event interfaces this subscriber wants to receive events from.
        Every subscriber must implement this configuration method.
        :return: A list of interface classes"""
        return [CuttingEvents, InventoryEvents]

    @override(RewardAggregatorInterface)
    def summarize_reward(self, maze_state: Optional[MazeStateType] = None) -> float:
        """Assign rewards and penalties according to respective events.

        :param maze_state: Not used by this reward aggregator.
        :return: List of individual event rewards.
        """

        rewards: List[float] = []

        # penalty for starting a new raw inventory piece
        for _ in self.query_events(InventoryEvents.piece_replenished):
            rewards.append(self.raw_piece_usage_penalty)

        # penalty for selecting an invalid piece for cutting
        for _ in self.query_events(CuttingEvents.invalid_piece_selected):
            rewards.append(self.invalid_action_penalty)

        # penalty for specifying invalid cutting parameters
        for _ in self.query_events(CuttingEvents.invalid_cut):
            rewards.append(self.invalid_action_penalty)

        return sum(rewards)
Updating the Core- and MazeEnv

We also have to make a few modifications in the CoreEnv:

  • Initialize the reward aggregator in the constructor.

  • Instead of accumulating reward in the if-else branches of the step function we summarize it only once at the end.

env/core_env.py
...

class Cutting2DCoreEnvironment(CoreEnv):
    """Environment for cutting 2D pieces based on the customer demand. Works as follows:
    ...
    :param reward_aggregator: Either an instantiated aggregator or a configuration dictionary.
    """

    def __init__(self, max_pieces_in_inventory: int, raw_piece_size: (int, int), static_demand: (int, int),
                 reward_aggregator: RewardAggregatorInterface):
        super().__init__()

        ...

        # init reward and register it with pubsub
        self.reward_aggregator = reward_aggregator
        self.pubsub.register_subscriber(self.reward_aggregator)

    def step(self, maze_action: Cutting2DMazeAction) -> Tuple[Cutting2DMazeState, np.array, bool, Dict[Any, Any]]:
        """Summary of the step (simplified, not necessarily respecting the actual order in the code):
        1. Check if the selected piece to cut is valid (i.e. in inventory, large enough etc.)
        2. Attempt the cutting
        3. Replenish a fresh piece if needed and return an appropriate reward

        :param maze_action: Cutting maze_action to take.
        :return: state, reward, done, info
        """

        info = {}
        replenishment_needed = False

        # check if valid piece id was selected
        if maze_action.piece_id >= self.inventory.size():
            self.cutting_events.invalid_piece_selected()
        # perform cutting
        else:
            piece_to_cut = self.inventory.pieces[maze_action.piece_id]

            # attempt the cut
            if self.inventory.cut(maze_action, self.current_demand):
                self.cutting_events.valid_cut(current_demand=self.current_demand, piece_to_cut=piece_to_cut,
                                              raw_piece_size=self.raw_piece_size)
                replenishment_needed = piece_to_cut == self.raw_piece_size
            else:
                # assign a negative reward for invalid cutting attempts
                self.cutting_events.invalid_cut(current_demand=self.current_demand, piece_to_cut=piece_to_cut,
                                                raw_piece_size=self.raw_piece_size)

        # check if replenishment is required
        if replenishment_needed:
            self.inventory.replenish_piece()
            # assign negative reward if a piece has to be replenished

        # step execution finished, write step statistics
        self.inventory.log_step_statistics()

        # compile env state
        maze_state = self.get_maze_state()

        # aggregate reward from events
        reward = self.reward_aggregator.summarize_reward(maze_state)

        return maze_state, reward, False, info

Finally, we update the maze_env_factory function for instantiating the trainable MazeEnv and we are all set up for training with event based, customized rewards.

env/maze_env.py
...


def maze_env_factory(max_pieces_in_inventory: int, raw_piece_size: (int, int),
                     static_demand: (int, int)) -> Cutting2DEnvironment:
    """Convenience factory function that compiles a trainable maze environment.
    (for argument details see: Cutting2DCoreEnvironment)
    """

    # init reward aggregator
    reward_aggregator = DefaultRewardAggregator(invalid_action_penalty=-2, raw_piece_usage_penalty=-1)

    # init core environment
    core_env = Cutting2DCoreEnvironment(max_pieces_in_inventory=max_pieces_in_inventory,
                                        raw_piece_size=raw_piece_size,
                                        static_demand=static_demand,
                                        reward_aggregator=reward_aggregator)

    # init maze environment including observation and action interfaces
    action_conversion = ActionConversion(max_pieces_in_inventory=max_pieces_in_inventory)
    observation_conversion = ObservationConversion(raw_piece_size=raw_piece_size,
                                                   max_pieces_in_inventory=max_pieces_in_inventory)
    return Cutting2DEnvironment(core_env, action_conversion, observation_conversion)
Where to Go Next

As the reward is implemented via a reward aggregator that is methodologically identical to the initial version there is no need to retrain the model for now. However, we highly recommend to proceed with the more advanced tutorial on Structured Environments and Action Masking.

API Documentation

This page provides an overview of the Maze API documentation

Environment Interfaces

This page contains the reference documentation for environment interfaces.

maze.core.env

Environment interfaces:

BaseEnv

Interface definition for reinforcement learning environments defining the minimum required functionality for being considered an environment.

ActorID

Identifies an actor in the environment.

StructuredEnv

Interface for environments with sub-step structure, which is generally enough to cover multi-step, hierarchical and multi-agent environments.

CoreEnv

Interface definition for core environments forming the basis for actual RL trainable environments.

StructuredEnvSpacesMixin

This interface complements the StructuredEnv by action and observation spaces.

MazeEnv

Base class for (gym style) environments wrapping a core environment and defining state and execution interfaces.

RenderEnvMixin

Interface for rendering functionality in environments (compatible with gym env).

RecordableEnvMixin

This interface provides a standard way of exposing internal MazeState and MazeAction objects for trajectory data recording.

SerializableEnvMixin

This interface provides a standard way of exposing environment components whose state should be serialized together with the environment state object when for example recording trajectory data.

TimeEnvMixin

This interface provides a standard way of exposing environment time to external components and wrappers.

EventEnvMixin

This interface provides a standard way of attaching environment events to the log statistics system.

SimulatedEnvMixin

Environment interface for simulated environments.

Interfaces for additional components:

ObservationConversionInterface

Interface specifying the conversion of abstract environment state to the gym-compatible observation.

ActionConversionInterface

Interface specifying the conversion of agent actions to actual environment MazeActions.

MazeStateType

Internal indicator of special typing constructs.

MazeActionType

Internal indicator of special typing constructs.

RewardAggregatorInterface

Event aggregation object for reward customization and shaping.

EnvironmentContext

This class keeps track of services that can be employed by all objects of the agent-environment loop.

Environment Wrappers

This page contains the reference documentation for environment wrappers. Here you can find a more extensive write up on how to work with these.

Interfaces and Utilities

These are the wrapper interfaces, base classes and interfaces:

Wrapper

A transparent environment Wrapper that works with any manifestation of BaseEnv.

Types of Wrappers:

ObservationWrapper

A Wrapper with typing support modifying the environments observation.

ActionWrapper

A Wrapper with typing support modifying the agents action.

RewardWrapper

A Wrapper with typing support modifying the reward before passed to the agent.

WrapperFactory

Handles dynamic registration of Wrapper sub-classes.

Built-in Wrappers

Below you find the reference documentation for environment wrappers.

General Wrappers:

LogStatsWrapper

A statistics logging wrapper for BaseEnv.

MazeEnvMonitoringWrapper

A MazeEnv monitoring wrapper logging events for observations, actions and rewards.

ObservationVisualizationWrapper

An observation visualization wrapper allows to apply custom observation visualization functions which are then shown in Tensorboard.

TimeLimitWrapper

Wrapper to limit the environment step count, equivalent to gym.wrappers.time_limit.

RandomResetWrapper

A wrapper skipping the first few steps by taking random actions.

SortedSpacesWrapper

This class wraps a given StructuredEnvSpacesMixin env to ensure that all observation- and action-spaces are sorted alphabetically.

NoDictSpacesWrapper

Wraps observations and actions by replacing dictionary spaces with the sole contained sub-space.

ObservationWrappers:

DictObservationWrapper

Wraps a single observation into a dictionary space.

ObservationStackWrapper

An wrapper stacking the observations of multiple subsequent time steps.

NoDictObservationWrapper

Wraps observations by replacing the dictionary observation space with the sole contained sub-space.

ActionWrappers:

DictActionWrapper

Wraps either a single action space or a tuple action space into dictionary space.

NoDictActionWrapper

Wraps actions by replacing the dictionary action space with the sole contained sub-space.

SplitActionsWrapper

Splits an actions into separate ones.

DiscretizeActionsWrapper

The DiscretizeActionsWrapper provides functionality for discretizing individual continuous actions into discrete

RewardWrappers:

RewardScalingWrapper

Scales original step reward by a multiplicative scaling factor.

RewardClippingWrapper

Clips original step reward to range [min, max].

ReturnNormalizationRewardWrapper

Normalizes step reward by dividing through the standard deviation of the discounted return.

Observation Pre-Processing Wrapper

Below you find the reference documentation for observation pre-processing. Here you can find a more extensive write up on how to work with the observation pre-processing package.

These are interfaces and components required for observation pre-processing:

PreProcessingWrapper

An observation pre-processing wrapper.

PreProcessor

Interface for observation pre-processors.

These are the available built-in maze.pre_processors compatible with the PreProcessingWrapper:

FlattenPreProcessor

An array flattening pre-processor.

OneHotPreProcessor

An one-hot encoding pre-processor for categorical features.

ResizeImgPreProcessor

An image resizing pre-processor.

TransposePreProcessor

An array transposition pre-processor.

UnSqueezePreProcessor

An un-squeeze pre-processor.

Rgb2GrayPreProcessor

An rgb-to-gray-scale conversion pre-processor.

Observation Normalization Wrapper

Below you find the reference documentation for observation normalization. Here you can find a more extensive write up on how to work with the observation normalization package.

These are interfaces and utility functions required for observation normalization:

ObservationNormalizationWrapper

An observation normalization wrapper.

ObservationNormalizationStrategy

Abstract base class for normalization strategies.

obtain_normalization_statistics

Obtain the normalization statistics of a given environment.

estimate_observation_normalization_statistics

Helper function estimating normalization statistics.

make_normalized_env_factory

Wrap an existing env factory to assign the passed normalization statistics.

These are the available built-in maze.normalization_strategies compatible with the ObservationNormalizationWrapper:

MeanZeroStdOneObservationNormalizationStrategy

Normalizes observations to have zero mean and standard deviation one.

RangeZeroOneObservationNormalizationStrategy

Normalizes observations to value range [0, 1].

Gym Environment Wrapper

Below you find the reference documentation for wrapping gym environments. Here you can find a more extensive write up on how to integrate Gym environments within Maze.

These are the contained components:

GymMazeEnv

Wraps a Gym env into a Maze environment.

make_gym_maze_env

Initializes a GymMazeEnv by registered Gym env name (id).

GymCoreEnv

Wraps a Gym environment into a maze core environment.

GymRenderer

A Maze-style Gym renderer.

GymObservationConversion

A dummy conversion interface asserting that the observation is packed into a dictionary space.

GymActionConversion

A dummy conversion interface asserting that the action is packed into a dictionary space.

Event System, Logging & Statistics

This page contains the reference documentation for the event and logging system.

Event System

These are interfaces, classes and utility functions of the event system:

Subscriber

Event aggregation object.

Pubsub

Implementation of a message broker (Pubsub stands for publish and subscribe).

event_topic_factory

Constructs a proxy instance of the event interface, as required by EventService and LogStatsAggregator.

EventScope

Base class for all services that integrate with the event system and therefore use EventService as their backend.

EventService

Manages the recording of event invocations and provides simple event routing functionality.

EventCollection

A collection of EventRecord instances that can be queried by event specification.

EventRecord

This auxiliary class is used to record calls to the event interface

Event Logging

These are the components of the event system:

StepEventLog

Logs all events dispatched by the environment during one step.

EpisodeEventLog

Keeps logs of all events dispatched by an environment during one episode.

KpiCalculator

Interface for calculating KPI metrics.

LogEventsWriterRegistry

Handles registration of event log writers.

LogEventsWriter

Interface for modules writing out the event log data.

LogEventsWriterTSV

Writes event logs into TSV files.

EventRow

Represents one row into the output file for the LogEventsWriterTSV.

SimpleEventLoggingSetup

Simple setup for logging of environment events with all their attributes.

ObservationEvents

Event topic class with logging statistics based only on observations, therefore applicable to any valid reinforcement learning environment.

ActionEvents

Event topic class with logging statistics based only on Gym space actions, therefore applicable to any valid reinforcement learning environment.

RewardEvents

Event topic class with logging statistics based only on rewards, therefore applicable to any valid reinforcement learning environment.

create_categorical_plot

Checks the type of value and calls the correct plotting function accordingly.

create_histogram

Creates simple matplotlib histogram of value.

create_relative_bar_plot

Counts the categories in value and prepares a relative bar plot of these.

create_violin_distribution

Creates simple matplotlib violin plot of value.

Statistics Logging

These are the components of the statistics logging system:

LogStatsEnv

Interface to access logging statistics generated by the environment.

LogStatsWriterConsole

Log statistics writer implementation for the console, mainly for debugging purposes.

LogStatsWriterTensorboard

Log statistics writer implementation for Tensorboard.

LogStatsLevel

Log statistics aggregation levels.

LogStatsConsumer

An interface to receive log statistics.

LogStatsAggregator

Complements the event system by providing aggregation functionality.

LogStatsWriter

A minimal interface concrete log statistics writers must implement.

GlobalLogState

Internal class that encapsulates the global state of the logging system.

LogStatsLogger

Auxiliary class returned by get_stats_logger.

register_log_stats_writer

Set the concrete writer implementation that will receive all successive statistics logging.

log_stats

Helper function.

increment_log_step

Notifies the logging system that the current step is finished.

get_stats_logger

Creates an object that can be used to pipe LogStatAggregator instances with the logging writers.

define_step_stats

Event method decorator, defines a new step statistics calculation for this event.

define_episode_stats

Event method decorator, defines a new episode statistics calculation for this event.

define_epoch_stats

Event method decorator, defines a new epoch statistics calculation for this event.

define_stats_grouping

Event method decorator, defines a grouping of all calculated statistics by an attribute.

define_plot

Event method decorator, defines a plot.

histogram

the histogram reducer function

LogStatsValue

Basic data structure for log statistics

LogStatsGroup

Basic data structure for log statistics

LogStatsKey

Basic data structure for log statistics

LogStats

Basic data structure for log statistics

Rendering

These are interfaces, classes and utility functions for the rendering system:

Renderer

Interface for renderers of individual environments.

StepStatsRenderer

Simple statistics rendering based on episode event logs.

EventStatsRenderer

Renders customizable statistics on top of event logs.

NotebookEventLogsViewer

Event logs viewer for Jupyter Notebooks, built using ipython widgets.

NotebookTrajectoryViewer

Trajectory viewer for Jupyter Notebooks, built using ipython widgets.

KeyboardControlledTrajectoryViewer

Render trajectory data with the possibility to browse back and forward through the episode steps using keyboard.

RendererArg

Interface for classes exposing arguments available at renderers.

IntRangeArg

Represents an argument which can take on a value of integer in a particular range.

OptionsArrayArg

Represents an argument where a single value can be chosen from an array of allowed options.

Trajectory Recorder

These are interfaces, classes and utility functions for recording trajectory data:

InMemoryDataset

Base class of trajectory data set for imitation learning that keeps all loaded data in memory.

DataLoadWorker

Data loading worker used to map states to actual observations.

TrajectoryProcessor

Base class for processing individual trajectories.

IdentityTrajectoryProcessor

Identity processing method

DeadEndClippingTrajectoryProcessor

Implementation of the dead-end-clipping preprocessor.

SpacesRecord

Record of spaces (i.e., raw action, observation, and associated data) from a single sub-step.

StepKeyType

The central part of internal API.

StructuredSpacesRecord

Records spaces (i.e., raw actions and observations) from a single environment step.

StateRecord

Keeps trajectory data for one step.

TrajectoryRecord

Common functionality of trajectory records.

StateTrajectoryRecord

Holds state record data (i.e., Maze states and actions, independent of the current

SpacesTrajectoryRecord

Holds structured spaces records (i.e., raw actions and observations recorded during a rollout).

MonitoringSetup

Simple setup for environment monitoring.

SimpleTrajectoryRecordingSetup

Simple setup for trajectory data recording.

TrajectoryWriterRegistry

Handles registration of trajectory data writers.

TrajectoryWriter

Interface for modules serializing the trajectory data.

TrajectoryWriterFile

Simple trajectory data writer.

General and Rollout Runners

This page contains the reference documentation for all kinds of runners.

General Runners

These are the basic interfaces, classes and utility functions of runners:

Runner

Runner interface for running Maze from CLI.

maze_run

Run a CLI task based on the provided configuration.

Rollout Runners

These are interfaces, classes and utility functions for rollout runners:

Here can find the documentation for training runners.

RolloutRunner

General abstract class for rollout runners.

RolloutGenerator

Rollouts a given policy in a given environment, recording the trajectory (in the form of raw actions and observations).

SequentialRolloutRunner

Runs rollout in the local process.

ParallelRolloutRunner

Runs rollout in multiple processes in parallel.

ParallelRolloutWorker

Class encapsulating functionality performed in worker processes.

EpisodeRecorder

Keeps the statistics and event logs from the last episode so that it can then be shipped to the main process.

EpisodeStatsReport

Tuple for passing episode stats from workers to the main process.

ExceptionReport

Tuple for passing error reports from the workers to the main process.

Policies, Critics and Agents

This page contains the reference documentation for policies, critics and agents.

maze.core.agent

Policies:

FlatPolicy

Generic flat policy interface.

Policy

Structured policy class designed to work with structured environments.

TorchPolicy

Encapsulates multiple torch policies along with a distribution mapper for training and rollouts in structured environments.

PolicySubStepOutput

Dataclass for holding the output of the policy’s compute full output method

PolicyOutput

A structured representation of a policy output over a full (flat) environment step.

DefaultPolicy

Encapsulates one or more policies identified by policy IDs.

RandomPolicy

Implements a random structured policy.

DummyCartPolePolicy

Dummy structured policy for the CartPole env.

SerializedTorchPolicy

Structured policy used for rollouts of trained models.

Critics:

StateCritic

Structured state critic class designed to work with structured environments.

StateCriticStepOutput

State Critic Step output holds the output of an a critic for an individual env step.

StateCriticOutput

Critic output holds the output of a critic for one full flat env step.

StateCriticStepInput

State Critic input for a single substep of the env, holding the tensor_dict and the actor_ids corresponding to where the embedding logits where retrieved if applicable, otherwise just the corresponding actor.

StateCriticInput

State Critic output defined as it’s own type, since it has to be explicitly build to be compatible with shared embedding networks.

TorchStateCritic

Encapsulates multiple torch state critics for training in structured environments.

TorchSharedStateCritic

One critic is shared across all sub-steps or actors (default to use for standard gym-style environments).

TorchStepStateCritic

Each sub-step or actor gets its individual critic.

TorchDeltaStateCritic

First sub step gets a regular critic, subsequent sub-steps predict a delta w.r.t.

StateActionCritic

Structured state action critic class designed to work with structured environments.

TorchStateActionCritic

Encapsulates multiple torch state action critics for training in structured environments.

TorchSharedStateActionCritic

One critic is shared across all sub-steps or actors (default to use for standard gym-style environments).

TorchStepStateActionCritic

Each sub-step or actor gets its individual critic.

Models:

TorchModel

Base class for any torch model.

TorchActorCritic

Encapsulates a structured torch policy and critic for training actor-critic algorithms in structured environments.

Agent Deployment

This page contains the reference documentation for the Maze agent deployment components.

AgentDeployment

Encapsulates an agent, space interfaces and a stack of wrappers, to make the agent’s MazeActions accessible to an external env.

PolicyExecutor

Executes the provided policies in an Agent Deployment setting.

ActionCandidates

Action object for encapsulation of multiple action objects along with their respective probabilities.

MazeActionCandidates

MazeAction object for encapsulation of multiple MazeAction objects along with their respective probabilities.

ActionConversionCandidatesInterface

Wrapper for action conversion interface when working with multiple candidate actions/MazeActions.

ExternalCoreEnv

Acts as a CoreEnv in the env stack in agent deployment scenario.

Perception Module

This page contains the reference documentation of Maze Perception Module.

maze.perception.blocks

These are basic neural network building blocks and interfaces:

PerceptionBlock

Interface for all perception blocks.

ShapeNormalizationBlock

Perception block normalizing the input and de-normalizing the output tensor dimensions.

InferenceBlock

An inference block combining multiple perception blocks into one prediction module.

InferenceGraph

Models a perception module inference graph.

Feed Forward: these are built-in feed forward building blocks:

DenseBlock

A block containing multiple subsequent dense layers.

VGGConvolutionBlock

A block containing multiple subsequent vgg style convolutions.

StridedConvolutionBlock

A block containing multiple subsequent strided convolution layers.

GraphConvBlock

A block containing multiple subsequent graph convolution stacks.

GraphAttentionBlock

A block containing multiple subsequent graph (multi-head) attention stacks.

MultiHeadAttentionBlock

Implementation of a torch MultiHeadAttention block.

PointNetFeatureBlock

PointNet block allowing to embed a variable sized set of point observations into a fixed size feature vector via the PointNet mechanics.

Recurrent: these are built-in recurrent building blocks:

LSTMBlock

A block containing multiple subsequent LSTM layers followed by a final time-distributed dense layer with explicit non-linearity.

General: these are build-in general purpose building blocks:

FlattenBlock

A flattening block.

CorrelationBlock

A feature correlation block.

ConcatenationBlock

A feature concatenation block.

FunctionalBlock

A block applying a custom callable.

GlobalAveragePoolingBlock

A global average pooling block.

MaskedGlobalPoolingBlock

A block applying global pooling with optional masking.

MultiIndexSlicingBlock

A multi-index-slicing block.

RepeatToMatchBlock

A repeat-to-match block.

SelfAttentionConvBlock

Implementation of a self-attention block as described by reference: https://arxiv.org/abs/1805.08318

SelfAttentionSeqBlock

Implementation of a self-attention block as described by reference: https://arxiv.org/abs/1706.03762

SliceBlock

A slicing block.

ActionMaskingBlock

An action masking block.

TorchModelBlock

A block transforming a common nn.Module to a shape-normalized Maze perception block.

Joint: these are build in joint building blocks combining multiple perception blocks:

FlattenDenseBlock

A block containing a flattening stage followed by a dense layer block.

VGGConvolutionDenseBlock

A block containing multiple subsequent vgg style convolution stacks followed by flattening and a dense layer block.

VGGConvolutionGAPBlock

A block containing multiple subsequent vgg style convolution stacks followed by global average pooling.

StridedConvolutionDenseBlock

A block containing multiple subsequent strided convolutions followed by flattening and a dense layer block.

LSTMLastStepBlock

A block containing a LSTM perception block followed by a Slicing Block keeping only the output of the final time step.

maze.perception.builders

These are template model builders:

BaseModelBuilder

Base class for perception default model builders.

ConcatModelBuilder

A model builder that first processes individual observations, concatenates the resulting latent spaces and then processes this concatenated output to action and value outputs.

maze.perception.models

These are model composers and components:

BaseModelComposer

Abstract baseclass and interface definitions for model composers.

TemplateModelComposer

Composes template models from configs.

CustomModelComposer

Composes models from explicit model definitions.

SpacesConfig

Represents configuration of environment spaces (action & observation) used for model config.

These are maze.perception.models.policies

BasePolicyComposer

Interface for policy (actor) network composers.

ProbabilisticPolicyComposer

Composes networks for probabilistic policies.

There are maze.perception.models.critics

CriticComposerInterface

Interface for critic (value function) network composers.

BaseStateCriticComposer

Interface for critic (value function) network composers.

SharedStateCriticComposer

One critic is shared across all sub-steps or actors (default to use for standard gym-style environments).

StepStateCriticComposer

Each sub-step or actor gets its individual critic.

DeltaStateCriticComposer

First sub step gets a regular critic, subsequent sub-steps predict a delta w.r.t.

StateCriticComposer

alias of maze.perception.models.critics.step_state_critic_composer.StepStateCriticComposer

BaseStateActionCriticComposer

Interface for state action (Q) critic network composers.

SharedStateActionCriticComposer

One critic is shared across all sub-steps or actors (default to use for standard gym-style environments).

StepStateActionCriticComposer

Each sub-step or actor gets its individual critic.

StateActionCriticComposer

alias of maze.perception.models.critics.step_state_action_critic_composer.StepStateActionCriticComposer

These are maze.perception.models.build_in models

FlattenConcatBaseNet

Base flatten and concatenation model for policies and critics.

FlattenConcatPolicyNet

Flatten and concatenation policy model.

FlattenConcatStateValueNet

Flatten and concatenation state value model.

maze.perception.perception_utils

These are some helper functions when working with the perception module:

observation_spaces_to_in_shapes

Convert an observation space to the input shapes for the neural networks

flatten_spaces

Merges an iterable of dictionary spaces (usually observations or actions from subsequent sub-steps) into a single dictionary containing all the items.

stack_and_flatten_spaces

Merges an iterable of dictionary spaces (usually observations or actions from subsequent sub-steps) into a single dictionary containing all the items.

convert_to_torch

Converts any struct to torch.Tensors.

convert_to_numpy

Convert torch to np

maze.perception.weight_init

These are some helper functions for initializing model weights:

make_module_init_normc

Compiles normc weight initialization function initializing module weights with normc_initializer and biases with zeros.

compute_sigmoid_bias

Compute the bias value for a sigmoid activation function such as in multi-binary action spaces (Bernoulli distributions).

Action Spaces and Distributions Module

This page contains the reference documentation of Maze Action Spaces and Distributions Module.

These are interfaces, classes and utility functions:

ProbabilityDistribution

Base class for all probability distributions.

TorchProbabilityDistribution

Base class for wrapping Torch probability distributions.

DistributionMapper

Provides a mapping of spaces and action heads to the respective probability distributions to be used.

atanh

Computes the arc-tangent hyperbolic.

tensor_clamp

Clamping with tensor and broadcast support.

These are built-in Torch probability distributions:

CategoricalProbabilityDistribution

Categorical Torch probability distribution.

BernoulliProbabilityDistribution

Bernoulli Torch probability distribution for multi-binary action spaces.

DiagonalGaussianProbabilityDistribution

Diagonal Gaussian (Normal) Torch probability distribution.

SquashedGaussianProbabilityDistribution

Tanh-squashed diagonal Gaussian (Normal) Torch probability distribution.

BetaProbabilityDistribution

Beta Torch probability distribution.

These are combined probability distributions:

MultiCategoricalProbabilityDistribution

Multi-categorical probability distribution.

DictProbabilityDistribution

Dictionary probability distribution.

Core Utilities

These are general interfaces, classes and utility functions:

override

Annotation for documenting method overrides.

unused

Function to annotate unused variables.

set_seeds_globally

Set random seeds for numpy, torch and python random number generators.

MazeSeeding

Manages the random seeding for maze.

flat_structured_space

Compiles a flat gym.spaces.Dict space from a structured environment space.

flat_structured_shapes

Flatten a dict of shape dicts to a single dict

read_config

Read YAML file into a dict

list_to_dict

Convert lists to int-indexed dicts.

EnvFactory

Helper class to instantiate an environment from configuration with the help of the Registry.

make_env_from_hydra

Create an environment instance from the hydra configuration, given the overrides.

Factory

Supports the creation of instances from configuration, that can be plugged into the environments (like demand generators or reward schemes).

ConfigType

Shorthand type for configuration corresponding to a single object.

CollectionOfConfigType

Shorthand type for a list or a dictionary of object parameters from the config files.

CumulativeMovingMeanStd

Maintains cumulative moving mean and std of incoming numpy arrays along axis 0.

Utilities

A collection of smaller auxiliary functions and classes:

maze.utils

SimpleStatsLoggingSetup

Helper class to simplify the statistics logging setup.

clear_global_state

Resets the seed and global state to ensure that consecutive tests run under the same preconditions.

setup_logging

Setup tensorboard logging, derive the logging directory from the script name.

Timeout

Timeout class, fires a TimeoutError after the given number of seconds elapsed.

tensorboard_to_pandas

Convert the tensorboard log to a pandas DataFrame.

Process

A wrapper for multiprocessing.Process that supports exception handling and return objects.

BColors

Colored command line output formatting

maze.hydra_plugins

MazeLocalLauncher

Custom Hydra launcher distributing the jobs in separate processes on the local machine.

LauncherConfig

Hardcoded launcher configuration, linking the hydra/launcher=local override to the MazeLocalLauncher class

Trainers and Training Runners

This page contains the reference documentation for trainers and training runners:

General

These are general interfaces, classes and utility functions for trainers and training runners:

Trainer

Interface for trainers.

TrainingRunner

Base class for training runner implementations.

TrainConfig

Top-level configuration structure.

ModelConfig

Model configuration structure.

AlgorithmConfig

Base class for all specific algorithm configurations.

ModelSelectionBase

Base class for model selection strategies.

BestModelSelection

Best model selection strategy.

Evaluator

Abstract interface for policy evaluation.

MultiEvaluator

Evaluates the given policy using multiple different evaluators (ran in sequence).

RolloutEvaluator

Evaluates a given policy by rolling it out and collecting the mean reward.

ValueTransform

Value transformation (e.g.

ReduceScaleValueTransform

Scale reduction value transform according to Pohlen et al (2018).

support_to_scalar

Convert support vector to scalar by probability weighted interpolation.

scalar_to_support

Converts tensor of scalars into probability support vectors corresponding to the provided range.

BaseReplayBuffer

Abstract interface for all replay buffer implementations.

UniformReplayBuffer

Replay buffer for off policy learning.

Trainers

These are interfaces, classes and utility functions for built-in trainers:

Actor-Critics (AC)

ACRunner

Abstract baseclass of AC runners.

ACDevRunner

Runner for single-threaded training, based on SequentialVectorEnv.

ACLocalRunner

Runner for locally distributed training, based on SubprocVectorEnv.

ActorCritic

Base class for actor critic trainers.

ActorCriticEvents

Event interface, defining statistics emitted by the A2CTrainer.

A2C

Advantage Actor Critic.

A2CAlgorithmConfig

Algorithm parameters for multi-step A2C model.

PPO

Proximal Policy Optimization trainer.

PPOAlgorithmConfig

Algorithm parameters for multi-step PPO model.

IMPALA

Multi step advantage actor critic.

ImpalaAlgorithmConfig

Algorithm parameters for Impala.

ImpalaEvents

Events specific for the impala algorithm, in order to record and analyse it’s behaviour in more detail

ImpalaRunner

Common superclass for IMPALA runners, implementing the main training controls.

ImpalaDevRunner

Runner for single-threaded training, based on SequentialVectorEnv.

ImpalaLocalRunner

Runner for locally distributed training, based on SubprocVectorEnv.

log_probs_from_logits_and_actions_and_spaces

Computes action log-probs from policy logits, actions and acton_spaces.

from_logits

V-trace for softmax policies.

from_importance_weights

V-trace from log importance weights.

get_log_rhos

With the selected log_probs for multi-discrete actions of behavior and target policies we compute the log_rhos for calculating the vtrace.

SAC

Multi step soft actor critic.

SACAlgorithmConfig

Algorithm parameters for SAC.

SACEvents

Events specific for the SAC algorithm, in order to record and analyse it’s behaviour in more detail

SACRunner

Common superclass for SAC runners, implementing the main training controls.

SACDevRunner

Runner for single-threaded training, based on SequentialVectorEnv.

Evolutionary Strategies (ES)

ESTrainer

Trainer class for OpenAI Evolution Strategies.

ESAlgorithmConfig

Algorithm parameters for evolution strategies model.

ESEvents

Event interface, defining statistics emitted by the ESTrainer.

ESMasterRunner

Baseclass of ES training master runners (serves as basis for dev and other runners).

ESDevRunner

Runner config for single-threaded training, based on ESDummyDistributedRollouts.

SharedNoiseTable

A fixed length vector of deterministically generated pseudo-random floats.

Optimizer

Abstract baseclass of an optimizer to be used with ES.

SGD

Stochastic gradient descent with momentum

Adam

Adam optimizer

ESRolloutResult

Result structure for distributed rollouts.

ESDummyDistributedRollouts

Implementation of the ES distribution by running the rollouts synchronously in the same process.

ESDistributedRollouts

Abstract base class of ES rollout distribution.

ESAbortException

This exception is raised if the current rollout is intentionally aborted.

ESRolloutWorkerWrapper

The rollout generation is bound to a single worker environment by implementing it as a Wrapper class.

get_flat_parameters

Get the parameters of all sub-policies as a single flat vector.

set_flat_parameters

Overwrite the parameters of all sub-policies by a single flat vector.

Imitation Learning (IL) and Learning from Demonstrations (LfD)

ImitationEvents

Event interface defining statistics emitted by the imitation learning trainers.

BCRunner

Dev runner for imitation learning.

BCTrainer

Trainer for behavioral cloning learning.

BCAlgorithmConfig

Algorithm parameters for behavioral cloning.

BCValidationEvaluator

Evaluates a given policy on validation data.

BCLoss

Loss function for behavioral cloning.

Utilities

stack_numpy_dict_list

Stack list of dictionaries holding numpy arrays as values.

unstack_numpy_list_dict

Inverse of stack_numpy_dict_list().

compute_gradient_norm

Computes the cumulative gradient norm of all provided parameters.

stack_torch_dict_list

Stack list of dictionaries holding torch tensors as values.

Parallelization

This page contains the reference documentation for the parallelization module.

Vectorized Environments

These are interfaces, classes and utility functions for vectorized environments:

VectorEnv

Abstract base class for vectorised environments.

StructuredVectorEnv

Common superclass for the structured vectorised env implementations in Maze.

SequentialVectorEnv

Creates a simple wrapper for multiple environments, calling each environment in sequence on the current Python process.

SubprocVectorEnv

Creates a multiprocess wrapper for multiple environments, distributing each environment to its own process.

CloudpickleWrapper

Uses cloudpickle to serialize contents (otherwise multiprocessing tries to use pickle).

SinkHoleConsumer

Sink hole statistics consumer.

disable_epoch_level_stats

Disable collection of statistics on epoch level to save memory.

Distributed Actors

These are interfaces, classes and utility functions for distributed actors:

DistributedActors

The base class for all distributed actors.

SequentialDistributedActors

Dummy implementation of distributed actors creates the actors as a list. Once the outputs are to

SubprocDistributedActors

Basic Distributed-Actors-Module using python multiprocessing.Process

BaseDistributedWorkersWithBuffer

The base class for all distributed workers with buffer.

DummyDistributedWorkersWithBuffer

Dummy implementation of distributed workers with buffer creates the workers as a list.

Utilities

Reusable components used in multiple distribution scenarios:

BroadcastingContainer

Synchronizes policy updates and other information across workers on local machine.

BroadcastingManager

A wrapper around BaseManager, used for managing the broadcasting container in multiprocessing scenarios.

Run Context

This page contains the reference documentation for Maze’ high-level Python API RunContext. This article documents in detail how to work with a RunContext as well as its benefits and limitations

Utilities

ConfigurationAuditor

Checks specified RunContext configuration for consistency and prepares it for the initialization procedure.

ConfigurationLoader

A ConfigurationLoader loads and post-processes a particular configuration for RunContext.

RunMode

Available run modes for Python API, associated with the corresponding base config module names.

RunContextError

Exception indicating Error in RunContext.

InvalidSpecificationError

Exception indicating Error due to inconsistent specification in RunContext.

Run Context

RunContext

RunContext offers convenient access to consistently configured training and rollout capabilities with minimal setup, yet is flexible enough to enable manipulation of every configurable aspect of Maze.

You can also find an extensive overview of Maze in the table of contents as well as the API documentation.

Spotlights

Below we list of some of Maze’s key features. The list is far from exhaustive but none the less a nice starting point to dive into the framework.

Warning

This is a preliminary, non-stable release of Maze. It is not yet complete and not all of our interfaces have settled yet. Hence, there might be some breaking changes on our way towards the first stable release.

This project is powered by enlite | Any questions or feedback, just get in touch email github_mark stackoverflow

Documentation Overview

Below you find an overview of the general Maze framework documentation, which is beyond the API documentation. The listed pages motivate and explain the underlying concepts but most importantly also provide code snippets and minimum working examples to quickly get you started.

Training

Here we show how to train a policy on a standard Gym or custom environment using algorithms and models from Maze. This guide focuses on the main mechanics of Maze training runs, plus also gives some pointers on how to customize the training with custom environments (using the tutorial Maze 2D-cutting environment as an example), models, etc.

The figure below shows a conceptual overview of the Maze training workflow.

_images/training.png

On this page:

In order to fully understand the configuration mechanisms used here, you should familiarize yourself with how Maze makes use of the Hydra configuration framework.

Example 1: Your First Training Run

We can train a policy from scratch on the Cartpole environment with default settings using the command:

$ maze-run -cn conf_train env=gym_env env.name=CartPole-v0

The -cn conf_train argument specifies that we would like to use conf_train.yaml as our root config file. This is needed as by default, configuration for rollouts is used.

Furthermore, we specify that gym_env configuration should be used, with CartPole-v0 as the Gym environment name. (For more information on how to read and customize the default configuration files, see Hydra overview.)

Such a training run consists of these main stages, loaded based on the default configuration provided by Maze:

  1. The full configuration is assembled via Hydra based on the config files available, the defaults set in root config, and the overrides you provide via CLI (see Hydra overview to understand more about this process). RunContext allows to do this from within your Python script and offers some convenient functionality, e.g. passing object instances, on top. See here for more information on the differences for training and rollout between the Python API, i.e. RunContext, and the CLI, i.e. maze-run.

  2. Hydra creates the output directory where all output files will be stored.

  3. The full configuration of the job is logged: (1) to standard output, (2) as a text entry to your Tensorboard logs, and (3) as a YAML file in the output directory.

  4. If the observation normalization wrapper is present, observation normalization statistics are collected and stored (note that no wrappers are applied by default).

  5. Policies and critics are initialized and their graphical depictions saved.

  6. The training starts, statistics are displayed in console and stored to a Tensorboard file, and current best model versions are saved (by default to state_dict.pt file).

  7. Once the training is done, final evaluation runs are performed and final model versions saved. (When the training is done depends on the training runner. Usually, this is specified using the runner.n_epochs argument, but the training can also end with early stopping if there is no more improvement).

As the job is running, you should see the statistics from the training and evaluation runs printed in your console, as mentioned in the 6. step:

...
********** Iteration 3 **********
 step|path                                                                                    |               value
=====|========================================================================================|====================
    4|eval                  DiscreteActionEvents  action                substep_0/action      |    [len:281, μ:0.5]
    4|eval                  BaseEnvEvents         reward                median_step_count     |              18.500
    4|eval                  BaseEnvEvents         reward                mean_step_count       |              28.100
    4|eval                  BaseEnvEvents         reward                total_step_count      |             928.000
    4|eval                  BaseEnvEvents         reward                total_episode_count   |              40.000
    4|eval                  BaseEnvEvents         reward                episode_count         |              10.000
    4|eval                  BaseEnvEvents         reward                std                   |              16.447
    4|eval                  BaseEnvEvents         reward                mean                  |              28.100
    4|eval                  BaseEnvEvents         reward                min                   |              16.000
    4|eval                  BaseEnvEvents         reward                max                   |              66.000
-> new overall best model 28.10000!
...

This main structure remains similar for all environment and training configurations.

Example 2: Customizing with Provided Components

When your Maze job is launched using maze-run from the CLI, the following happens under the hood:

  1. A job configuration is assembled by putting available configuration files together with the overrides you specify as arguments to the run command. More on that can be found in configuration documentation page, specifically in Hydra overview.

  2. The complete assembled configuration is handed over to the Maze runner specified in the configuration (in the runner group). This runner then launches and manages the training (or any other) job.

The common points for customizing the training run correspond to the configuration groups listed in the training root config file, namely:

  • Environment (env configuration group), configuring which environment the training runs on, as well as customizing any other inner configuration of the environment, if available (like raw piece size in 2D cutting environment)

  • Training algorithm (algorithm configuration group), specifying the algorithm used and configuration for it

  • Model (model configuration group), specifying how the models for policies and (optionally) critics should be assembled

  • Runner (runner configuration group), specifying options for how the training is run (e.g. locally, in development mode, or using Ray on a Kubernetes cluster). The runner is also the main object responsible for administering the whole training run (and runners are thus specific to individual algorithms used).

Maze provides a host of configuration files useful for working with standard Gym environments and environments provided by Maze (such as the 2D cutting environment). Hence, to use these, it suffices to supply appropriate overrides, without writing any additional configuration files.

By default, the gym_env configuration is used, which allows us to specify the Gym env that we would like to instantiate:

$ maze-run -cn conf_train env=gym_env env.name=LunarLander-v2

With appropriate overrides, we can also include vector observation model and wrappers (providing normalization):

$ maze-run -cn conf_train env=gym_env env.name=LunarLander-v2 wrappers=vector_obs model=vector_obs

Alternatively, we could use the tutorial Cutting 2D environment:

$ maze-run -cn conf_train env=tutorial_cutting_2d_struct_masked \
wrappers=tutorial_cutting_2d model=tutorial_cutting_2d_struct_masked

Further, by default, the algorithm used is Evolution Strategies (the implementation is provided by Maze). To use a different algorithm, e.g. PPO with a shared critic, we just need to add the appropriate overrides:

$ maze-run -cn conf_train algorithm=ppo env=tutorial_cutting_2d_struct_masked \
  wrappers=tutorial_cutting_2d model=tutorial_cutting_2d_struct_masked

To see all the configuration files available out-of-the-box, check out the maze/conf package.

Example 3: Resuming Previous Training Runs

In case a training run fails (e.g. because your server goes down) there is no need to restart training entirely from scratch. You can simply pass a previous experiment as an input_dir and the Maze trainers will initialize the model weights including all other relevant artifacts such as normalization statistics from the provided directory. Below you find a few examples where this might be useful.

This is the initial training run:

$ maze-run -cn conf_train env=gym_env env.name=LunarLander-v2 algorithm=ppo

Once trained we can resume this run with:

We could also resume training with a refined learning rate:

$ maze-run -cn conf_train env=gym_env env.name=LunarLander-v2 algorithm=ppo \
  algorithm.lr=0.0001 input_dir=outputs/<experiment-dir>

or even with a different (compatible) trainer such as a2c:

Training in Your Custom Project

While the default environments and configurations are nice to get started quickly or test different approaches in standard scenarios, the primary focus of Maze are fully custom environments and models solving real-world problems (which are of course much more fun as well!).

The best place to start with a custom environment is the Maze step by step tutorial (mentioned already in the previous section) showing how to implement a custom Maze environment from scratch, along with respective configuration files (see also Hydra: Your Own Configuration Files).

Then, you can easily launch your environment by supplying your own configuration file (here we use one from the tutorial):

$ maze-run -cn conf_train env=tutorial_cutting_2d_struct_masked \
  wrappers=tutorial_cutting_2d model=tutorial_cutting_2d_struct_masked

For links to more customization options (like building custom models with Maze Perception Module), check out the Where to Go Next section.

While customizing other configuration groups listed in the previous section (e.g., algorithm, runner) is not needed as often, all of these can be customized in an analogous way (i.e., implement your own components that plug into the framework instead of the default ones, and then add your own config to be able to configure them from the command line). When using the Python API with RunContext, you can also bypass configuration files and plug in your instantiated components directly.

Plain Python Training

Maze offers training also from within your Python script. This can be achieved manually - by generating all necessary components yourself - or managed - by utilizing RunContext, which provides managed training and rollout capabilities.

Managed Setup

RunContext initializes a context for running training and rollout with a shared configuration. Its functionality and interfaces are mostly congruent with the CLI’s, however there are some significant changes (e.g. being able to pass instantiated Python objects instead of relying exclusively on configuration dictionary objects or files). See here for a more thorough introduction.

Manual Setup

In most use cases, it will probably be more convenient to launch training directly from the CLI and just implement your custom components (wrappers, environments, models, etc.) as needed. However, the inner architecture of Maze should be sufficiently modular to allow you to modify just the parts that you want.

Because each of the algorithms included in Maze has slightly different needs, the usage will likely slightly differ. However, regardless of which algorithm you intend to use, the TrainingRunner subclasses offer good examples of what components you will need for launching training directly from Python.

Specifically, you’ll need to concentrate on the run method, which takes as an argument the full assembled hydra configuration (which is printed to the command line every time you launch a job).

Usually, the run method does roughly the following:

  • Instantiates the environment and policy components (some of this functionality is provided by the shared TrainingRunner superclass, as a large part of that is common for all training runners)

  • Assembles the policy and critics into a structured policy

  • Instantiates the trainer and any other components needed for training

  • Launches the training

TrainingRunner initializes the runner and TrainingRunner runs the training.

For example, these are the setup and run methods taken directly from the evolution strategies runner:

    @override(TrainingRunner)
    def setup(self, cfg: DictConfig) -> None:
        """
        Setup the training master node.
        """

        super().setup(cfg)

        # --- init the shared noise table ---
        print("********** Init Shared Noise Table **********")
        self.shared_noise = SharedNoiseTable(count=self.shared_noise_table_size)

        # --- initialize policies ---

        torch_policy = TorchPolicy(networks=self._model_composer.policy.networks,
                                   distribution_mapper=self._model_composer.distribution_mapper, device="cpu")
        torch_policy.seed(self.maze_seeding.agent_global_seed)

        # support policy wrapping
        if self._cfg.algorithm.policy_wrapper:
            policy = Factory(Policy).instantiate(
                self._cfg.algorithm.policy_wrapper, torch_policy=torch_policy)
            assert isinstance(policy, Policy) and isinstance(policy, TorchModel)
            torch_policy = policy

        print("********** Trainer Setup **********")
        self._trainer = ESTrainer(
            algorithm_config=cfg.algorithm,
            torch_policy=torch_policy,
            shared_noise=self.shared_noise,
            normalization_stats=self._normalization_statistics
        )

        # initialize model from input_dir
        self._init_trainer_from_input_dir(trainer=self._trainer, state_dict_dump_file=self.state_dict_dump_file,
                                          input_dir=cfg.input_dir)

        self._model_selection = BestModelSelection(dump_file=self.state_dict_dump_file, model=torch_policy,
                                                   dump_interval=self.dump_interval)
    @override(TrainingRunner)
    def run(
            self,
            n_epochs: Optional[int] = None,
            distributed_rollouts: Optional[ESDistributedRollouts] = None,
            model_selection: Optional[ModelSelectionBase] = None
    ) -> None:
        """
        See :py:meth:`~maze.train.trainers.common.training_runner.TrainingRunner.run`.
        :param distributed_rollouts: The distribution interface for experience collection.
        :param n_epochs: Number of epochs to train.
        :param model_selection: Optional model selection class, receives model evaluation results.
        """

        print("********** Run Trainer **********")

        env = self.env_factory()
        env.seed(self.maze_seeding.generate_env_instance_seed())

        # run with pseudo-distribution, without worker processes
        self._trainer.train(
            n_epochs=self._cfg.algorithm.n_epochs if n_epochs is None else n_epochs,
            distributed_rollouts=self.create_distributed_rollouts(
                env=env, shared_noise=self.shared_noise,
                agent_instance_seed=self.maze_seeding.generate_agent_instance_seed()
            ) if distributed_rollouts is None else distributed_rollouts,
            model_selection=self._model_selection if model_selection is None else model_selection
        )

Where to Go Next

Rollouts

During rollouts, the agent interacts with a given environment, issuing actions obtained from a given policy (be it a heuristic or a trained policy).

Usually, the purpose of rollouts is either evaluation (or even deployment) of a given policy in a given environment, or collection of trajectory data. Collected trajectory data can later be used for further learning (e.g. imitation learning) or for inspecting the policy behavior more closely using trajectory viewers.

_images/rollouts.png

On this page:

The First Rollout

Rollouts can be run from the command line, using the maze-run command. Rollout configuration (conf_rollout) is used by default. Hence, to run your first rollout, it suffices to execute:

$ maze-run env=gym_env env.name=CartPole-v0

This runs a rollout of a random policy on cartpole environment. Statistics from the rollout are printed to the console, and trajectory data with event logs are stored in the output directory automatically configured by Hydra.

Alternatively, we might configure the rollouts to run just one episode in sequential mode and render the env (but more on that and other configuration options below):

$ maze-run env=gym_env env.name=CartPole-v0 runner=sequential runner.n_episodes=1 runner.render=true

Rollout Runner Configuration

Rollouts are run by rollout runners, which are agent- and environment-agnostic (for configuring environments and agents, see the following section).

By default, rollouts are run in multiple processes in parallel (as can be seen in the rollout configuration file, which lists runner: parallel in the defaults), and are handled by the ParallelRolloutRunner.

Alternatively, rollouts can be run sequentially in a single process by opting for the sequential runner configuration:

$ maze-run env=gym_env env.name=CartPole-v0 runner=sequential

This is mainly useful when running a single episode only or for debugging, as sequential rollouts are much slower.

The available configuration options for both scenarios are listed in the Hydra runner package (conf/runner/).

These are the parameters for parallel rollout runner:

# @package _group_
_target_: maze.core.rollout.parallel_rollout_runner.ParallelRolloutRunner

# Number of processes to run the rollouts in concurrently
n_processes: 5

# Total number of episodes to run
n_episodes: 50

# Max steps per episode to perform
max_episode_steps: 200

# If true, trajectory data will be recorded and stored in :code:`trajectory_data` directory
record_trajectory: true

# If true, event logs will be recorded and stored in `event_logs_directory
record_event_logs: true

# (Note that the default output directory is handled by Hydra)

Using these parameters, we can modify the rollout to e.g. be run only in 3 processes, and be comprised of 100 episodes, each of max 50 steps:

$ maze-run env=gym_env env.name=CartPole-v0 runner.n_processes=3 \
  runner.n_episodes=100 runner.max_episode_steps=10

(Alternatively, you can create your own configuration file that you will then supply to the maze-run command as described in Hydra primer section).

Environment and Policy Configuration

Environment and policy are configured using the env, resp. policy Hydra packages. Rollout runners are environment- and agent-agnostic, and will attempt to instantiate the type specified in the config files using Maze Factory.

Environment is expected to conform to the StructuredEnv interface and agent to the StructuredPolicy interface.

For agents, there are the following example config files:

  • policy/random_policy.yaml for instantiating a class that conforms to the StructuredPolicy interface directly

  • policy/cutting_2d_greedy_policy (in maze-envs/logistics) for wrapping (potentially multiple) flat policies into a structured policy

  • policy/torch_policy (in maze/train) for loading and rolling out a policy trained using the Maze framework

Hence, after training a policy on the tutorial Cutting 2D environment:

$ maze-run -cn conf_train env=tutorial_cutting_2d_struct_masked
  wrappers=tutorial_cutting_2d model=tutorial_cutting_2d_struct_masked

We can roll it out using:

$ maze-run policy=torch_policy env=tutorial_cutting_2d_struct_masked wrappers=tutorial_cutting_2d \
  model=tutorial_cutting_2d_struct_masked input_dir=outputs/[training-output-dir]

Note that for this to work, the training-output-dir parameter must be set to the output directory of the training run (the model state dict and other configuration will be loaded from there).

Plain Python Configuration

Rollout runners are primarily designed to support running through Hydra from command line. That being said, you can of course instantiate and use the runners directly in Python if you have some special needs.

from maze.core.agent.dummy_cartpole_policy import DummyCartPolePolicy
from maze.core.rollout.sequential_rollout_runner import SequentialRolloutRunner
from maze.core.wrappers.maze_gym_env_wrapper import GymMazeEnv

# Instantiate an example environment and agent
env = GymMazeEnv("CartPole-v0")
agent = DummyCartPolePolicy()

# Run a sequential rollout with rendering
# (including an example wrapper the environment will be wrapped in)
sequential = SequentialRolloutRunner(
    n_episodes=10,
    max_episode_steps=100,
    record_trajectory=True,
    record_event_logs=True,
    render=True)
sequential.run_with(
    env=env,
    wrappers={"maze.core.wrappers.reward_scaling_wrapper.RewardScalingWrapper": {"scale": 0.1}},
    agent=agent)

Using the snippet above, you can run a rollout on any agent and environment directly from Python (parallel rollouts can be run similarly).

However, note that the rollout runners are currently designed to be run only once (which is their main use case for runs initiated from the command line). Running them repeatedly might cause some issues especially with statistics and event logging, as the runners initiate new writers every time (so you might get duplicate outputs) and some of these operations are order-sensitive (especially for the parallel rollouts where some state might be carried over to child processes).

Where to Go Next

If you collected trajectory data during the rollout, you might want to:

Deployment

In an experimental setting, deploying an agent means running a rollout on a given test environment and evaluating the results. However, in a real-world scenario, when we are dealing with a production environment, running a rollout is usually not so easily feasible.

The main difference between experimental rollouts and real-world environments is that production environments often do not follow the Gym model and cannot be easily stepped. Instead, the control flow is inverted, with the environment querying the agent for an action, whenever it is ready.

_images/agent_deployment.png

The catch here is that in the Gym environment model, the wrappers that modify the environment behavior are considered to be a part of the environment (i.e., they are stepped together with it). However, during deployment, the production environment expects the wrapper stack to be maintained by the agent (after all, the production environment should not concern itself with the likes of observation post-processing and step-skipping).

The AgentDeployment component in Maze deals with exactly this matter – packaging the policy together with the wrapper stack and other components, so you can only call act in a production setting and get a processed action back, with things like statistics logging and observation frame stacking staying intact.

Building a Deployment Agent

There are two ways to build a deployment agent – either from an already-instantiated policy and environment (which may include a stack of wrappers):

from maze.core.agent_deployment.agent_deployment import AgentDeployment
from maze.test.shared_test_utils.dummy_env.agents.dummy_policy import DummyGreedyPolicy
from maze.test.shared_test_utils.helper_functions import build_dummy_maze_env

agent_deployment = AgentDeployment(
    policy=DummyGreedyPolicy(),
    env=build_dummy_maze_env()
)

Or, by providing configuration dictionary for the policy and environment (and, optionally, for wrappers) obtained from hydra or elsewhere:

from maze.core.agent_deployment.agent_deployment import AgentDeployment
from maze.core.utils.config_utils import read_hydra_config

# Note: Needs to be a rollout config (`conf_rollout`), so the policy config is present as well
cfg = read_hydra_config(config_module="maze.conf", config_name="conf_rollout", env="gym_env")

agent_deployment = AgentDeployment(
    policy=cfg.policy,
    env=cfg.env,
    wrappers=cfg.wrappers
)

(The configuration structure here is shared with rollouts. To better understand it, see Rollouts.)

Alternatively, you can mix and match these approaches, providing an already-instantiated policy and an environment config, or vice versa.

After that, you can already start querying the agent for actions using the act method:

maze_action = agent_deployment.act(maze_state, reward, done, info)

When the episode is done, you should close the agent deployment. At this point, the agent deployment resets the env to write out statistics and ensure all wrappers finish the episode properly.

Note

Ensure that you query the agent for actions from a single episode, in order as the states are encountered. Otherwise, parts of the wrapper stack (like stats logging or observation frame stacking) might become inconsistent, leading to passing wrong observations to the policy.

Note

Currently, the Agent Deployment supports a single episode only. Once the episode is done, close the deployment and initialize a new instance. Support for continued resets will likely be added in the future.

The full working example below demonstrates the agent deployment on Gym CartPole environment, where we initialize the agent deployment and then use the external_env to simulate an external production env (to obtain states from), looks like:

import gym

from maze.core.agent.random_policy import RandomPolicy
from maze.core.agent_deployment.agent_deployment import AgentDeployment
from maze.core.wrappers.maze_gym_env_wrapper import GymMazeEnv

env = GymMazeEnv("CartPole-v0")
policy = RandomPolicy(action_spaces_dict=env.action_spaces_dict)

agent_deployment = AgentDeployment(
    policy=policy,
    env=env
)

# Simulate an external production environment that does not use Maze
external_env = gym.make("CartPole-v0")

maze_state = external_env.reset()
reward, done, info = 0, False, {}

for i in range(10):
    # Query the agent deployment for maze action, then step the environment with it
    maze_action = agent_deployment.act(maze_state, reward, done, info)
    maze_state, reward, done, info = external_env.step(maze_action)

agent_deployment.close(maze_state, reward, done, info)

Notice that above, we are dealing with Maze states and Maze actions, i.e., in the format they come directly from the environment. The translation to policy-friendly format of actions and observations is handled as part of the wrapper stack (where they are passed through action/observation conversion interfaces and the individual wrappers).

How does this work under the hood?

When initializing the AgentDeployment, an existing environment including the wrapper stack is taken (or first initialized, if environment configuration was passed in).

Then, the core environment (the component just under the maze environment, providing the core functionality – see Maze Environment Hierarchy) is swapped out for so called ExternalCoreEnv, and then executed in a run loop with the provided policy on a separate thread.

The external core env hijacks execution of the step function, and pauses the thread, waiting until new maze_state object (with associated reward, etc.) is passed in from the AgentDeployment, which runs on the main thread.

Despite what this looks like at the first glance, this is not a concurrent setup. The threads are used only for hijacking the execution in the step function, and they never run concurrently during env stepping.

Either the main thread, or the second thread are always paused. First, the environment on the second thread is waiting for obtaining the next maze state from agent deployment on the first thread. Then, agent deployment is waiting for the environment run loop to iterate to the next step function, and return the processed action back.

Where to Go Next

Collecting and Visualizing Rollouts

While the Event System provides an overview of notable events happening during an episode through statistics and event logs, it is often needed to dig deeper and visualize the full environment state at a given time step.

With the Maze Trajectory Viewer, it is possible replay past episodes from collected trajectory data in a Jupyter Notebook.

_images/trajectory_viewer-overview.png

Requirements

Note

Rollouts visualization in a notebook is not currently available for Gym environments.

The trajectory viewer notebook requires the environment to implement a Maze-compatible Renderer based on matplotlib. The tutorial 2D cutting environment serves as a perfect example – see the Adding a Renderer section to understand how to implement one.

Unfortunately, Maze does not yet support rendering from trajectory data for standard Gym environments. For such environments, you can render only during the rollout itself by setting the corresponding option on the sequential renderer (i.e., provide the following overrides for rollouts: runner=sequential runner.render=true).

Trajectory Data Collection

When using a compliant environment, past trajectories can be rendered directly from the trajectory data. These are usually collected using the rollout runners via CLI.

To simply collect trajectory data of a heuristic policy on the tutorial Cutting 2D environment, run:

$ maze-run env=tutorial_cutting_2d_flat policy=tutorial_cutting_2d_greedy_policy

Alternatively (and closer to a real training setting), you might want to first train an RL policy on the tutorial 2D cutting environment:

$ maze-run -cn conf_train env=tutorial_cutting_2d_struct_masked
  wrappers=tutorial_cutting_2d model=tutorial_cutting_2d_struct_masked

and then roll it out to collect the trajectory data (make sure to substitute the input_dir value for your actual training output directory):

$ maze-run policy=torch_policy env=tutorial_cutting_2d_struct_masked wrappers=tutorial_cutting_2d \
  model=tutorial_cutting_2d_struct_masked input_dir=outputs/[training-output-dir]

Once the rollout has run, take note of the outputs directory created by Hydra, where the trajectory data will be logged – by default inside the trajectory_data subdirectory, one pickle file per episode (identified by a UUID generated for each episode).

(Whether trajectory data is recorded during a rollout is set using the runner.record_trajectory flag, which is on by default.)

Trajectory Visualization

Maze includes a Jupyter Notebook in evaluation/viewer.ipynb that will guide you through the process. You only need to supply a path to the outputs directory where you trajectory data reside. The renderer will be automatically built from the trajectory data.

(Note that the notebook also lists example trajectory data in case you do not have any on hand.)

Once an episode is selected and loaded, it is possible to skim back and forward in time using the notebook widgets slider (controllable by mouse or keyboard).

_images/trajectory_viewer-screen.png

Where to Go Next

  • To understand in more detail how to train a policy and then roll it out to collect trajectory data, check out Trainings and Rollouts.

  • Rendering and reviewing each time step in detail comes with a lot of overhead. In case you just want to see and easily compare notable events that happened across different episodes, you might want to review the Event system and how it is used to log statistics, KPIs, and raw events.

Imitation Learning and Fine-Tuning

Imitation learning refers to the task of learning a policy by imitating the behaviour of an existing teacher policy usually represented as a fixed set of example trajectories. In some scenarios we might even have direct access to the actual teacher policy itself allowing us to generate as many training trajectories as required. Imitation learning is especially useful for initializing a policy to quick-start an actual training by interaction run or for settings where no training environment is available at all (e.g., offline RL).

Since imitation learning involves rollouts, this is as of yet not supported by RunContext. A guide for managed pure-Python imitation learning will be provided together will rollout support.

_images/imitation_workflow.png

Collect Training Trajectory Data

This section explains how to rollout a policy for collecting example trajectories. As the training trajectories might be already available (e.g., collected in practice) this step is optional.

As an example environment we pick the discrete version of the LunarLander environment as it already provides a heuristic policy which we can use to collect or training trajectories for imitation learning.

_images/lunar_lander.png

But first let’s check if the policy actually does something meaningful by running a few rendering rollouts:

maze-run env.name=LunarLander-v2 policy=lunar_lander_heuristics \
runner=sequential runner.render=true runner.n_episodes=3

Hopefully this looks good and we can continue with actually collecting example trajectories for imitation learning.

The command bellow performs 3 rollouts of the heuristic policy and records them to the output directory.

maze-run env.name=LunarLander-v2 policy=lunar_lander_heuristics runner.n_episodes=3

You will get the following output summarizing the statistics of the rollouts.

 step|path                                                                  |           value
=====|======================================================================|================
    1|rollout_stats    DiscreteActionEvents  action|  substep_0/action      |[len:583, μ:1.2]
    1|rollout_stats    BaseEnvEvents         reward|  median_step_count     |         200.000
    1|rollout_stats    BaseEnvEvents         reward|  mean_step_count       |         194.333
    1|rollout_stats    BaseEnvEvents         reward|  total_step_count      |         583.000
    1|rollout_stats    BaseEnvEvents         reward|  total_episode_count   |           3.000
    1|rollout_stats    BaseEnvEvents         reward|  episode_count         |           3.000
    1|rollout_stats    BaseEnvEvents         reward|  std                   |          51.350
    1|rollout_stats    BaseEnvEvents         reward|  mean                  |         190.116
    1|rollout_stats    BaseEnvEvents         reward|  min                   |         121.352
    1|rollout_stats    BaseEnvEvents         reward|  max                   |         244.720

The trajectories will be dumped similar to the file structure shown below.

- outputs/<experiment_path>
    - maze_cli.log
    - event_logs
    - trajectory_data
        - 00653455-d7e2-4737-a82b-d6d1bfce12f7.pkl
        - ...

The pickle files contain the distinct episodes recorded as StateTrajectoryRecord objects, each containing a sequence of StateRecord objects, which keep the trajectory data for one step (state, action, reward, …).

Learn from Example Trajectories

Given the trajectories recorded in the previous step we now train a policy with behavioral cloning, a simple version of imitation learning.

To do so we simply provide the trajectory data as an argument and run:

maze-run -cn conf_train env.name=LunarLander-v2 model=vector_obs wrappers=vector_obs \
algorithm=bc algorithm.validation_percentage=50 \
runner.dataset.dir_or_file=<absolute_experiment_path>/trajectory_data
...
********** Epoch 24: Iteration 1500 **********
 step|path                                                                    |    value
=====|========================================================================|=========
   96|train     ImitationEvents       discrete_accuracy     0/action          |    0.948
   96|train     ImitationEvents       policy_loss           0                 |    0.150
   96|train     ImitationEvents       policy_entropy        0                 |    0.209
   96|train     ImitationEvents       policy_l2_norm        0                 |   42.416
   96|train     ImitationEvents       policy_grad_norm      0                 |    0.870
 step|path                                                                    |    value
=====|========================================================================|=========
   96|eval      ImitationEvents       discrete_accuracy     0/action          |    0.947
   96|eval      ImitationEvents       policy_loss           0                 |    0.152
   96|eval      ImitationEvents       policy_entropy        0                 |    0.207
-> new overall best model -0.15179!
...

As with all trainers, we can watch the training progress with Tensorboard.

tensorboard --logdir outputs/
_images/tb_imitation.png

Once training is complete we can check how the behaviourally cloned policy performs in action.

maze-run env.name=LunarLander-v2 model=vector_obs wrappers=vector_obs \
policy=torch_policy input_dir=outputs/<imitation-learning-experiment>
 step|path                                                                 |           value
=====|=====================================================================|=================
    1|rollout_stats    DiscreteActionEvents  action    substep_0/action    |[len:8033, μ:1.2]
    1|rollout_stats    BaseEnvEvents         reward    median_step_count   |          186.000
    1|rollout_stats    BaseEnvEvents         reward    mean_step_count     |          160.660
    1|rollout_stats    BaseEnvEvents         reward    total_step_count    |         8033.000
    1|rollout_stats    BaseEnvEvents         reward    total_episode_count |           50.000
    1|rollout_stats    BaseEnvEvents         reward    episode_count       |           50.000
    1|rollout_stats    BaseEnvEvents         reward    std                 |          111.266
    1|rollout_stats    BaseEnvEvents         reward    mean                |          101.243
    1|rollout_stats    BaseEnvEvents         reward    min                 |         -164.563
    1|rollout_stats    BaseEnvEvents         reward    max                 |          282.895

With a mean reward of 101 this already looks like a promising starting point for RL fine-tuning.

Fine-Tune a Pre-Trained Policy

In the last section we show how to fine-tune the pre-trained policy with a model-free RL learner such as PPO. It is basically a standard PPO training run initialized with the imitation learning output.

maze-run -cn conf_train env.name=LunarLander-v2 model=vector_obs critic=template_state wrappers=vector_obs \
algorithm=ppo runner.eval_repeats=100 runner.critic_burn_in_epochs=10 \
input_dir=outputs/<imitation-learning-experiment>

Once training started we can observe the progress with Tensorboard (for the sake of clarity of this example we renamed the experiment directories for the screenshot below).

The Tensorboard log below compares the following experiments:

  • a randomly initialized policy trained with learning rate 0.0 (random-PPO-lr0)

  • a behavioural cloning pre-trained policy trained with learning rate 0.0 (pre_trained-PPO-lr0)

  • a randomly initialized policy trained with PPO (from_scratch-PPO)

  • a behavioural cloning pre-trained policy trained with PPO (pre_trained-PPO)

We also included training runs with a learning rate of 0.0 to get a feeling for the performance of the initial performance of the two models (randomly initialized vs. pre-trained).

_images/tb_reward_comparison.png

As expected, we see that PPO fine-tuning of the pretrained model starts at an initially much higher reward level compared to the model trained entirely from scratch.

Although this is a quite simple example it is still a nice showcase for the usefulness of this two-stage learning paradigm. For scenarios with delayed and/or sparse rewards following this principle is often crucial to get the RL trainer to start learning at all.

Where to Go Next

Experiment Configuration

Launching experiments with the Maze command line interface (CLI) is based on the Hydra configuration system and hence also closely follows Hydra’s experimentation workflow. In general, there are different options for carrying out and configuring experiments with Maze. (To see experiment configuration in action, check out our project template.)

Command Line Overrides

To quickly play around with parameters in an interactive (temporary) fashion you can utilize Hydra command line overrides to reset parameters specified in the default config (e.g., conf_train).

$ maze-run -cn conf_train env.name=CartPole-v0 algorithm=ppo algorithm.lr=0.0001

The example above changes the trainer to PPO and optimizes with a learning rate of 0.0001. You can of course override any other parameter of your training and rollout runs.

For an in depth explanation of the override concept we refer to our Hydra documentation.

Experiment Config Files

For a more persistent way of structuring your experiments you can also make use of Hydra’s built-in Experiment Configuration.

This allows you to maintain multiple experimental config files each only specifying the changes to the default config (e.g., conf_train).

conf/experiment/cartpole_ppo_wrappers.yaml
# @package _global_

# defaults to override
defaults:
  - override /algorithm: ppo
  - override /wrappers: vector_obs

# overrides
algorithm:
  lr: 0.0001

The experiment config above sets the trainer to PPO, the learning rate to 0.0001 and additionally activates the vector_obs wrapper stack.

To start the training run with this config file, run:

$ maze-run -cn conf_train +experiment=cartpole_ppo_wrappers

You can find a more detail explanation on how experiments are embedded in the overall configuration system in our Hydra experiment documentation.

Hyperparameter Optimization

Maze also support hyper parameter optimization beyond vanilla grid search via Nevergrad (in case you have enough resources available).

Note

Hyperparameter optimization is not supported by RunContext yet.

You can start with the experiment template below and adopt it to your needs (for details on how to define the search space we refer to the Hydra docs and this example).

conf/experiment/nevergrad.yaml
# @package _global_

# defaults to override
defaults:
  - override /algorithm: ppo
  - override /hydra/sweeper: nevergrad
  - override /hydra/launcher: local
  - override /runner: local

# set training runner concurrency
runner:
  concurrency: 0

# overrides
hydra:
  sweeper:
    optim:
      # name of the nevergrad optimizer to use
      # OnePlusOne is good at low budget, but may converge early
      optimizer: OnePlusOne
      # total number of function evaluations to perform
      budget: 100
      # number of parallel workers for performing function evaluations
      num_workers: 4
      # we want to maximize reward
      maximize: true

    # default parametrization of the search space
    parametrization:
      # a linearly-distributed scalar
      algorithm.lr:
        lower: 0.00001
        upper: 0.001
      algorithm.entropy_coef:
        lower: 0.0000025
        upper: 0.025

# Hint: make sure that runner.concurrency * hydra.sweeper.optim.num_workers <= CPUs

To start a hyper parameter optimization, run:

$ maze-run -cn conf_train env.name=Pendulum-v0 \
  algorithm.n_epochs=5 +experiment=nevergrad --multirun

Where to Go Next

Introducing the Perception Module

One of the key ingredients for successfully training RL agents in complex environments is their combination with powerful representation learners; in our case PyTorch-based neural networks. These enable the agent to perceive all kinds of observations (e.g. images, audio waves, sensor data, …), unlocking the full potential of the underlying RL-based learning systems.

Maze supports neural network building blocks via the Perception Module, which is responsible for transforming raw observations into standardized, learned latent representations. These representations are then utilized by the Action Spaces and Distributions Module to yield policy as well as critic outputs.

_images/perception_overview.png

This page provides a general introduction into the Perception Module (which we recommend to read, of course). However, you can also start using the module right away and jump to the template or custom models section.

List of Features

Below we list the key features and design choices of the perception module:

  • Based on PyTorch.

  • Supports dictionary observation spaces.

  • Provides a large variety of neural network building blocks and model styles for customizing policy and value networks:

    • feed forward: dense, convolution, graph convolution and attention, …

    • recurrent: LSTM, last-step-LSTM, …

    • general purpose: action and observation masking, self-attention, concatenation, slicing, …

  • Provides shape inference allowing to derive custom models directly from observation space definitions.

  • Allows for environment specific customization of existing network templates per yaml configuration.

  • Definition of complex networks explicitly in Python using Maze perception blocks and/or PyTorch.

  • Generates detailed visualizations of policy and value networks (model graphs) containing the perception building blocks as well as all intermediate representation produced.

  • Can be easily extended with custom network components if necessary.

Perception Blocks

Perception blocks are components for composing models such as policy and value networks within Maze. They implement PyTorch’s nn.Module interface and encapsulate neural network functionality into distinct, reusable units. In order to handle all our requirements (listed in the motivation below), every perception block expects a tensor dictionary as input and produce a tensor dictionary again as an output.

_images/perception_block.png

Maze already supports a number of built-in neural network building blocks which are, like all other components, easily extendable.

Motivation: Maze introduces perception blocks to extend PyTorch’s nn.Module with shape inference to support the following features:

  1. To derive, generate and customize template models directly from observation and action space definitions.

  2. To visualize models and how these process observations to ultimately arrive at an action or value prediction.

  3. To seamlessly apply models at different stages of the RL development processes without the need for extensive input reshaping regardless if we perform a distributed training using parallel rollout workers or if we deploy a single agent in production. The figure below shows a few examples of such scenarios.

_images/perception_dim_specification.png

Inference Blocks

The InferenceBlock, a special perception block, combines multiple perception blocks into one prediction module. This is convenient and allows us to easily reuse semantically connected parts of our models but also enables us to derive and visualize inference graphs of these models. This is feasible as perception blocks operate with input and output tensor dictionaries, which can be easily linked to an inference graph.

The figure below shows a simple example of how such a graph can look like.

_images/inference_graph_example.png

Details:

  • The model depicted in the figure above takes two observations as inputs:

    • obs_inventory : a 16-dimensional feature vector

    • obs_screen : a 64 x 64 RGB image

  • obs_inventory is processed by a DenseBlock resulting in a 32-dimensional latent representation.

  • obs_screen is processed by a VGG-style model resulting in a 32-dimensional latent representation.

  • Next, these two representations are concatenated into a joint representation with dimension 64.

  • Finally we have two LinearOutputBlocks yielding the logits for two distinct action heads:

Comments on visualization: Blue boxes are blocks, while red ones are tensors. The color depth of blocks (blue) indicates the number of the parameters relative to the total number of parameters.

Model Composers

Model Composers, as the name suggest, compose the models and as such bring all components of the perception module together under one roof. In particular, they hold:

  • Definitions of observation and actions spaces.

  • All defined models, that is, policies (multiple ones in multi-step scenarios) and critics (multiple ones in multi-step scenarios depending on the critic type).

  • The Distribution Mapper, mapping (possible custom) probability distributions to action spaces.

Maze supports different types of model composers and we will show how to work with template and custom models in detail later on.

Implementing Custom Perception Blocks

In case you would like to implement and use custom components when designing your models you can add new blocks by implementing:

  • The PerceptionBlock interface common for all perception blocks.

  • The ShapeNormalizationBlock interface normalizing the input and de-normalizing the output tensor dimensions if required for your block (optional).

  • The respective forward pass of your block.

The code-snippet below shows a simple toy-example block, wrapping a linear layer into a Maze perception block.

"""Contains a single linear layer block."""
import builtins
from typing import Union, List, Sequence, Dict

import torch
from torch import nn as nn

from maze.core.annotations import override
from maze.perception.blocks.shape_normalization import ShapeNormalizationBlock

Number = Union[builtins.int, builtins.float, builtins.bool]


class MyLinearBlock(ShapeNormalizationBlock):
    """A linear output block holding a single linear layer.

    :param in_keys: One key identifying the input tensors.
    :param out_keys: One key identifying the output tensors.
    :param in_shapes: List of input shapes.
    :param output_units: Count of output units.
    """

    def __init__(self,
                 in_keys: Union[str, List[str]],
                 out_keys: Union[str, List[str]],
                 in_shapes: Union[Sequence[int], List[Sequence[int]]],
                 output_units: int):
        super().__init__(in_keys=in_keys, out_keys=out_keys, in_shapes=in_shapes, in_num_dims=2, out_num_dims=2)

        self.input_units = self.in_shapes[0][-1]
        self.output_units = output_units

        # initialize the linear layer
        self.net = nn.Linear(self.input_units, self.output_units)

    @override(ShapeNormalizationBlock)
    def normalized_forward(self, block_input: Dict[str, torch.Tensor]) -> Dict[str, torch.Tensor]:
        """implementation of :class:`~maze.perception.blocks.shape_normalization.ShapeNormalizationBlock` interface
        """
        # extract the input tensor of the first (and here only) input key
        input_tensor = block_input[self.in_keys[0]]
        # apply the linear layer
        output_tensor = self.net(input_tensor)
        # return the output tensor as a tensor dictionary
        return {self.out_keys[0]: output_tensor}

    def __repr__(self):
        """This is the text shown in the graph visualization."""
        txt = self.__class__.__name__
        txt += f"\nOut Shapes: {self.out_shapes()}"
        return txt

The Bigger Picture

The figure below shows how the components introduced in the perception module relate to each other.

_images/perception_bigger_picture.png

Where to Go Next

  • For further details please see the reference documentation.

  • Action Spaces and Distributions

  • Working with template models

  • Working with custom models

  • Pre-processing and observation normalization

Action Spaces and Distributions

In response to the states perceived and rewards received, RL agents interact with their environment by taking appropriate actions. Depending on the problem at hand there are different types of actions an agent must be able to deal with (e.g. categorical, binary, continuous, …).

To support this requirement Maze introduces the Distribution Module which builds on top of the Perception Module allowing to fully customize which probability distributions to link with certain action spaces or even individual action heads.

List of Features

The distribution module provides the following key features:

  • Supports flat dictionary action spaces (nested dict spaces are not yet supported)

  • Supports a variety of different action spaces and probability distributions

  • Supports customization of which probability distribution to use for which action space or head

  • Supports action masking in combination with the perception module

  • Allows to add custom probability distributions whenever required

Action Spaces and Probability Distributions

Maze so far supports the following action space - probability distribution combinations.

Action Space

Available Distributions

Discrete

Categorical (default)

Multi-Discrete

Multi-Categorical (default)

(Multi-)Binary

Bernoulli (default)

Box (Continuous)

Diagonal-Gaussian (default), Beta, Squashed-Gaussian

Dict

DictProbabilityDistribution (default)

The DictProbabilityDistribution combines any of the other action spaces and distributions into a joint action space in case you agent has to interact with the environment via different action space types at the same time.

Note that the table above does not always follow a one-to-one mapping. In case of a Box (Continuous) action space for example you can choose between a Diagonal-Gaussian distribution in case of an unbounded action space or a Beta or a Squashed-Gaussian distribution in case of a bounded action space. In other cases you might even want to add additional probability distributions according to the nature of the environment you are facing.

To allow for easy customization of the links between action spaces and distributions Maze introduces the DistributionMapper for which we show usage examples below.

Example 1: Mapping Action Spaces to Distributions

Adding the snippet below to your model config specifies the following:

  • Use Beta distributions for all Box action spaces.

  • All other action spaces behave as specified in the defaults.

# @package model
distribution_mapper_config:
  - action_space: gym.spaces.Box
    distribution: maze.distributions.beta.BetaProbabilityDistribution

Example 2: Mapping Actions to Distributions

Adding the snippet below to your model config specifies the following:

  • Use Beta distributions for all Box action spaces.

  • Use a Squashed-Gaussian distributions for the action with key “special_action”.

  • All other action spaces behave as specified in the defaults.

# @package model
distribution_mapper_config:
  - action_space: gym.spaces.Box
    distribution: maze.distributions.beta.BetaProbabilityDistribution
  - action_head: special_action
    distribution: maze.distributions.squashed_gaussian.SquashedGaussianProbabilityDistribution

When specifying custom behaviour for distinct action heads make sure to add them below the more general action space configurations (e.g. get more specific from top to bottom).

Example 3: Using Custom Distributions

In case the probability distributions contained in Maze are not sufficient for your use case you can of course add additional custom probability distributions.

# @package model
distribution_mapper_config:
  - action_space: gym.spaces.Discrete
    distribution: my_package.maze_extentions.distributions.CustomCategoricalProbabilityDistribution

The example above defines to use a CustomCategoricalProbabilityDistribution for all discrete action spaces. When adding a new distribution you (1) have to implement the ProbabilityDistribution interface and (2) make sure that it is accessible within your python path. Besides that you only have to provide the reference path of the probability distribution you would like to use.

Example 4: Plain Python Configuration

For completeness we also provide a code snippet in plain Python showing how to:

  • Define a simple policy network.

  • Instantiate a default DistributionMapper.

  • Use the DistributionMapper to compute the required logits shapes for the Policy network.

  • Compute action logits from a random observation.

  • Instantiate the appropriate probability distribution and sample actions.

"""Minimum working example showing how to sample actions from a policy network."""
from typing import Dict, Sequence

import torch
from gym import spaces
from torch import nn

from maze.distributions.distribution_mapper import DistributionMapper

OBSERVATION_NAME = 'my_observation'
ACTION_NAME = 'my_action'


class PolicyNet(nn.Module):
    """Simple feed forward policy network."""

    def __init__(self,
                 obs_shapes: Dict[str, Sequence[int]],
                 action_logits_shapes: Dict[str, Sequence[int]]):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(in_features=obs_shapes[OBSERVATION_NAME][0], out_features=16), nn.Tanh(),
            nn.Linear(in_features=16, out_features=action_logits_shapes[ACTION_NAME][0]))

    def forward(self, in_dict: Dict[str, torch.Tensor]) -> Dict[str, torch.Tensor]:
        """ forward pass. """
        return {ACTION_NAME: self.net(in_dict[OBSERVATION_NAME])}


# init default distribution mapper
distribution_mapper = DistributionMapper(
    action_space=spaces.Dict(spaces={ACTION_NAME: spaces.Discrete(2)}),
    distribution_mapper_config={})

# request required action logits shape and init a policy net
logits_shape = distribution_mapper.required_logits_shape(ACTION_NAME)
policy_net = PolicyNet(obs_shapes={OBSERVATION_NAME: (4,)},
                       action_logits_shapes={ACTION_NAME: logits_shape})

# compute action logits (here from random input)
logits_dict = policy_net({OBSERVATION_NAME: torch.randn(4)})

# init action sampling distribution from model output
dist = distribution_mapper.logits_dict_to_distribution(logits_dict, temperature=1.0)

# sample action (e.g., {my_action: 1})
action = dist.sample()

The Bigger Picture

The figure below relates the distribution module with the overall workflow.

_images/distribution_module.png

The distribution mapper takes the (dictionary) action space as an input and links the action spaces with the respective probability distributions specified in the config. Action logits are learned on top of the representation produced by the perception module where each probability distribution specifies its expected logits shape.

Where to Go Next

  • For further details please see the reference documentation.

  • Processing raw observations with the Maze Perception Module.

  • Customizing models with Hydra.

Working with Template Models

The Maze template model composer allows us to compose policy and value networks directly from an environment’s observation and action space specification according to a selected model template and a corresponding model config. The central part of a template model composer is the Model Builder holding an Inference Block template (architecture template), which is then instantiated according to the config.

Next, we will introduce the general working principles. However, you can of course directly jump to the examples below to see how to build a feed forward as well as a recurrent policy network using the ConcatModelBuilder or check out how to work with simple single observation and action environments.

List of Features

A template model supports the following features:

  • Works with dictionary observation spaces.

  • Maps individual observations to modalities via the Observation Modality Mapping.

  • Allows to individually assign Perception Blocks to modalities via the Modality Config.

  • Allows to pick architecture templates defining the underlying modal structure via Maze Model Builders.

  • Cooperates with the Distributions Module supporting customization of action and value outputs.

  • Allows to individually specify shared embedding keys for actor critic models; this enables shared embeddings between actor and critic.

_images/template_model_builder.png

Note

Maze so far does not support “end-to-end” default behaviour but instead provides config templates, which can be adopted to the respective needs. We opted for this route as complete defaults might lead to unintended and non-transparent results.

Model Builders (Architecture Templates)

This section lists and describes the available Model Builder architectures templates. Before we describe the builder instances in more detail we provide some information on the available block types:

  • Fixed: these blocks are fixed and are applied by the model builder per default.

  • Preset: these blocks are predefined for the respective model builder. They are basically place holders for which you can specify the perception blocks they should hold.

  • Custom: these blocks are introduced by the user for processing the respective observation modalities (types) such as features or images.

ConcatModelBuilder (Reference Documentation)

_images/concat_model_builder.png

Model builder details:

  • Processes the individual observations with modality blocks (custom).

  • Joins the respective modality hidden representations via a concatenation block (fixed).

  • The resulting representation is then further processed by the hidden and recurrence block (preset).

Example 1: Feed Forward Models

In this example we utilize the ConcatModelBuilder to compose a feed forward actor-critic model processing two observations for predicting two actions and one critic (value) output.

Observation Space:

  • observation_inventory : a 16-dimensional feature vector

  • observation_screen : a 64 x 64 RGB image

Action Space:

  • action_move : a categorical action with four options deciding to move [UP, DOWN, LEFT, RIGHT]

  • action_use : a 16-dimensional multi-binary action deciding which item to use from inventory

The model config is defined as:

# @package model
_target_: maze.perception.models.template_model_composer.TemplateModelComposer

# specify distribution mapping
# (here we use a default distribution mapping)
distribution_mapper_config: []

# specifies the architecture of default models
model_builder:
  _target_: maze.perception.builders.ConcatModelBuilder

  # Specify up to which keys the embedding should be shared between actor and critic
  shared_embedding_keys: ~

  # specifies the modality type of each observation
  observation_modality_mapping:
    observation_inventory: feature
    observation_screen: image

  # specifies with which block to process a modality
  modality_config:
    # modality processing
    feature:
      block_type: maze.perception.blocks.DenseBlock
      block_params:
        hidden_units: [32, 32]
        non_lin: torch.nn.ReLU
    image:
      block_type: maze.perception.blocks.VGGConvolutionDenseBlock
      block_params:
        hidden_channels: [8, 16, 32]
        hidden_units: [32]
        non_lin: torch.nn.ReLU
    # preserved keys for the model builder
    hidden:
      block_type: maze.perception.blocks.DenseBlock
      block_params:
        hidden_units: [128]
        non_lin: torch.nn.ReLU
    recurrence: {}

# select policy type
policy:
  _target_: maze.perception.models.policies.ProbabilisticPolicyComposer

# select critic type
critic:
  _target_: maze.perception.models.critics.StateCriticComposer

Details:

  • Models are composed by the Maze TemplateModelComposer.

  • No specific action space and probability distribution overrides are specified.

  • The model is based on the ConcatModelBuilder architecture template.

  • No shared embedding is used.

  • Observation observation_inventory is mapped to the user specified custom modality feature.

  • Observation observation_screen is mapped to the user specified custom modality image.

  • Modality Config:

    • Modalities of type feature are processed with a DenseBlock.

    • Modalities of type image are processed with a VGGConvolutionDenseBlock.

    • The concatenated joint spaces is processed with another DenseBlock.

    • No recurrence is employed.

The resulting inference graphs for an actor-critic model are shown below:

_images/ff_concat_policy_graph.png _images/ff_concat_critic_graph.png

Example 2: Recurrent Models

In this example we utilize the ConcatModelBuilder to compose a recurrent actor-critic model for the the previous example.

# @package model
_target_: maze.perception.models.default_model_composer.DefaultModelComposer

# specify distribution mapping
# (here we use a default distribution mapping)
distribution_mapper_config: []

# specifies the architecture of default models
model_builder:
  _target_: maze.perception.builders.ConcatModelBuilder

  # Specify up to which keys the embedding should be shared between actor and critic
  shared_embedding_keys: ~

  # specifies the modality type of each observation
  observation_modality_mapping:
    observation_inventory: feature
    observation_screen: image

  # specifies with which block to process a modality
  modality_config:
    # modality processing
    feature:
      block_type: maze.perception.blocks.DenseBlock
      block_params:
        hidden_units: [32, 32]
        non_lin: torch.nn.ReLU
    image:
      block_type: maze.perception.blocks.VGGConvolutionDenseBlock
      block_params:
        hidden_channels: [8, 16, 32]
        hidden_units: [32]
        non_lin: torch.nn.ReLU
    # preserved keys for the model builder
    hidden:
      block_type: maze.perception.blocks.DenseBlock
      block_params:
        hidden_units: [128]
        non_lin: torch.nn.ReLU
    recurrence:
      block_type: maze.perception.blocks.LSTMLastStepBlock
      block_params:
        hidden_size: 32
        num_layers: 1
        bidirectional: False
        non_lin: torch.nn.SELU

# select policy type
policy:
  _target_: maze.perception.models.policies.ProbabilisticPolicyComposer

# select critic type
critic:
  _target_: maze.perception.models.critics.StateCriticComposer

Details:

  • The main part of the model is identical to the example above.

  • However, the example adds an additional recurrent block (LSTMLastStepBlock) considering not only the present but also the k previous time steps for its action and value predictions.

The resulting inference graphs for a recurrent actor-critic model are shown below:

_images/rnn_concat_policy_graph.png _images/rnn_concat_critic_graph.png

Example 3: Single Observation and Action Models

Even though designed for more complex models which process multiple observations and prediction multiple actions at the same time you can of course also compose models for simpler use cases.

In this example we utilize the ConcatModelBuilder to compose an actor-critic model for OpenAI Gym’s CartPole Env. CartPole has an observation space with dimensionality four and a discrete action spaces with two options.

The model config is defined as:

# @package model
_target_: maze.perception.models.template_model_composer.TemplateModelComposer

# specify distribution mapping
# (here we use a default distribution mapping)
distribution_mapper_config: []

# specifies the architecture of default models
model_builder:
  _target_: maze.perception.builders.ConcatModelBuilder

  # Specify up to which keys the embedding should be shared between actor and critic
  shared_embedding_keys: ~

  # specifies the modality type of each observation
  observation_modality_mapping:
    observation: feature

  # specifies with which block to process a modality
  modality_config:
    # modality processing
    feature:
      block_type: maze.perception.blocks.DenseBlock
      block_params:
        hidden_units: [32, 32]
        non_lin: torch.nn.ReLU
    # preserved keys for the model builder
    hidden: {}
    recurrence: {}

# select policy type
policy:
  _target_: maze.perception.models.policies.ProbabilisticPolicyComposer

# select critic type
critic:
  _target_: maze.perception.models.critics.StateCriticComposer

The resulting inference graphs for an actor-critic model are shown below:

_images/cartpole_concat_policy_graph1.png _images/cartpole_concat_critic_graph1.png

Details:

  • When there is only one observation, as for the present example, concatenation acts simply as an identity mapping of the previous output tensor (in this case observation_DenseBlock).

Example 4: Shared Embedding Feed Forward Model

In case of large dimensional input spaces it might sometimes be useful to share the embedding between the actor and critic network when training with an actor-critic algorithm. In this example we showcase how this can be done with the template model composer on the Example 1. Here everything stays the same, with one small exception: Now we specify the shared embedding key in the model builder config to be [‘latent’] as can be seen below.

The model config is defined as:

# @package model
_target_: maze.perception.models.template_model_composer.TemplateModelComposer

# specify distribution mapping
# (here we use a default distribution mapping)
distribution_mapper_config: []

# specifies the architecture of default models
model_builder:
  _target_: maze.perception.builders.ConcatModelBuilder

  # Specify up to which keys the embedding should be shared between actor and critic
  shared_embedding_keys: ['latent']

  # specifies the modality type of each observation
  observation_modality_mapping:
    observation_inventory: feature
    observation_screen: image

  # specifies with which block to process a modality
  modality_config:
    # modality processing
    feature:
      block_type: maze.perception.blocks.DenseBlock
      block_params:
        hidden_units: [32, 32]
        non_lin: torch.nn.ReLU
    image:
      block_type: maze.perception.blocks.VGGConvolutionDenseBlock
      block_params:
        hidden_channels: [8, 16, 32]
        hidden_units: [32]
        non_lin: torch.nn.ReLU
    # preserved keys for the model builder
    hidden:
      block_type: maze.perception.blocks.DenseBlock
      block_params:
        hidden_units: [128]
        non_lin: torch.nn.ReLU
    recurrence: {}

# select policy type
policy:
  _target_: maze.perception.models.policies.ProbabilisticPolicyComposer

# select critic type
critic:
  _target_: maze.perception.models.critics.StateCriticComposer

Now the output of the perception block ‘latent’ will be used as the input to the critic network. The resulting inference graphs for an actor-critic model are shown below:

_images/ff_shared_policy_graph.png _images/ff_shared_critic_graph.png

Where to Go Next

  • You can read up on our general introduction to the Perception Module.

  • Here we explain how to define and work with custom models in case the template models are not sufficient.

Working with Custom Models

The Maze custom model composer enables us to explicitly specify application specific models directly in Python. Models can be either written with Maze perception blocks or with plain PyTorch as long as they inherit from Pytorch’s nn.Model.

As such models can be easily created, and even existing models from previous work or well known papers can be easily reused with minor adjustments. However, we recommend to create models using the predefined perception blocks in order to speed up writing as well as to take full advantage of features such as shape inference and graphical rendering of the models.

On this page we cover the features and general working principles. Afterwards we demonstrate the custom model composer with three examples:

List of Features

The custom model composer supports the following features:

  • Specify complex models directly in Python.

  • Supports shape inference and shape checks for a given observation space when relying on Maze perception blocks.

  • Reuse existing PyTorch nn.Models with minor modifications.

  • Stores a graphical rendering of the networks if the InferenceBlock is utilized.

  • Custom weight initialization and action head biasing.

  • Custom shared embedding between actor and critic.

_images/perception_custom_model_composer.png

The Custom Models Signature

The constraints we impose on any model used in conjunction with the custom model composer are threefold: fist, the network class has to adhere to PyTorch’s nn.Model and implement the forward method. Second, a custom network class requires specified constructor arguments depending on the type of network (policy, state critic, …). And lastly the model has to return a dictionary when calling the forward method.

Policy Networks require the constructor arguments obs_shapes and action_logits_shapes. When models are built in the the custom model composer these two arguments are passed to the model constructor in addition to any other arbitrary arguments specified. obs_shapes is a dictionary, mapping observation names to their corresponding shapes. Similarly action_logits_shapes is a dictionary that maps action names to their corresponding action logits shapes. Both, observation and action logits shapes are automatically inferred in the model composer.

  • implement nn.Model

  • constructor arguments: obs_shapes and action_logits_shapes

  • return type of forward method: Here the forward method has to return a dict, where the keys correspond to the actions of the environment.

State Critic Networks require only the constructor argument obs_shapes.

  • implement nn.Model

  • constructor arguments: obs_shapes

  • return type of forward method: The critic networks also have to return a dict, where the key is ‘value’.

Example 1: Simple Networks with Perception Blocks

Even though designed for more complex models that process multiple observations and predict multiple actions at the same time you can also compose models for simpler use cases, of course.

In this example we utilize the custom model composer in combination with the perception blocks to compose an actor-critic model for OpenAI Gym’s CartPole using a single dense block in each network. CartPole has an observation space with dimensionality four and a discrete action space with two options.

The policy model can then be defined as:

"""Shows how to use the custom model composer to build a custom policy network."""
from collections import OrderedDict
from typing import Dict, Union, Sequence, List

import numpy as np
import torch
import torch.nn as nn

from maze.perception.blocks.feed_forward.dense import DenseBlock
from maze.perception.blocks.inference import InferenceBlock
from maze.perception.blocks.output.linear import LinearOutputBlock
from maze.perception.weight_init import make_module_init_normc


class CustomCartpolePolicyNet(nn.Module):
    """Simple feed forward policy network.

    :param obs_shapes: The shapes of all observations as a dict.
    :param action_logits_shapes: The shapes of all actions as a dict structure.
    :param non_lin: The nonlinear activation to be used.
    :param hidden_units: A list of units per hidden layer.
    """

    def __init__(self, obs_shapes: Dict[str, Sequence[int]], action_logits_shapes: Dict[str, Sequence[int]],
                 non_lin: Union[str, type(nn.Module)], hidden_units: List[int]):
        super().__init__()

        # Maze relies on dictionaries to represent the inference graph
        self.perception_dict = OrderedDict()

        # build latent embedding block
        self.perception_dict['latent'] = DenseBlock(
            in_keys='observation', out_keys='latent', in_shapes=obs_shapes['observation'],
            hidden_units=hidden_units,non_lin=non_lin)

        # build action head
        self.perception_dict['action'] = LinearOutputBlock(
            in_keys='latent', out_keys='action', in_shapes=self.perception_dict['latent'].out_shapes(),
            output_units=int(np.prod(action_logits_shapes["action"])))

        # build inference block
        self.perception_net = InferenceBlock(
            in_keys='observation', out_keys='action', in_shapes=obs_shapes['observation'],
            perception_blocks=self.perception_dict)

        # apply weight init
        self.perception_net.apply(make_module_init_normc(1.0))
        self.perception_dict['action'].apply(make_module_init_normc(0.01))

    def forward(self, in_tensor_dict: Dict[str, torch.Tensor]) -> Dict[str, torch.Tensor]:
        """Compute forward pass through the network.

        :param in_tensor_dict: Input tensor dict.
        :return: The computed output of the network.
        """
        return self.perception_net(in_tensor_dict)

And the critic model as:

"""Shows how to use the custom model composer to build a custom value network."""
from collections import OrderedDict
from typing import Dict, Union, Sequence, List

import torch
import torch.nn as nn

from maze.perception.blocks.feed_forward.dense import DenseBlock
from maze.perception.blocks.inference import InferenceBlock
from maze.perception.blocks.output.linear import LinearOutputBlock
from maze.perception.weight_init import make_module_init_normc


class CustomCartpoleCriticNet(nn.Module):
    """Simple feed forward critic network.

    :param obs_shapes: The shapes of all observations as a dict.
    :param non_lin: The nonlinear activation to be used.
    :param hidden_units: A list of units per hidden layer.
    """

    def __init__(self, obs_shapes: Dict[str, Sequence[int]], non_lin: Union[str, type(nn.Module)],
                 hidden_units: List[int]):
        super().__init__()

        # Maze relies on dictionaries to represent the inference graph
        self.perception_dict = OrderedDict()

        # build latent embedding block
        self.perception_dict['latent'] = DenseBlock(
            in_keys='observation', out_keys='latent', in_shapes=obs_shapes['observation'], hidden_units=hidden_units,
            non_lin=non_lin)

        # build action head
        self.perception_dict['value'] = LinearOutputBlock(
            in_keys='latent', out_keys='value', in_shapes=self.perception_dict['latent'].out_shapes(), output_units=1)

        # build inference block
        self.perception_net = InferenceBlock(
            in_keys='observation', out_keys='value', in_shapes=obs_shapes['observation'],
            perception_blocks=self.perception_dict)

        # apply weight init
        self.perception_net.apply(make_module_init_normc(1.0))
        self.perception_dict['value'].apply(make_module_init_normc(0.01))

    def forward(self, in_tensor_dict: Dict[str, torch.Tensor]) -> Dict[str, torch.Tensor]:
        """Compute forward pass through the network.

        :param in_tensor_dict: Input tensor dict.
        :return: The computed output of the network.
        """
        return self.perception_net(in_tensor_dict)

An example config for the model composer could then look like this:

# @package model

# specify the custom model composer by reference
_target_: maze.perception.models.custom_model_composer.CustomModelComposer

# Specify distribution mapping
# (here we use a default distribution mapping)
distribution_mapper_config: []

policy:
  # first specify the policy type
  _target_: maze.perception.models.policies.ProbabilisticPolicyComposer
  # specify the policy network(s) we would like to use, by reference
  networks:
  - _target_: docs.source.policy_and_value_networks.code_snippets.custom_cartpole_policy_net.CustomCartpolePolicyNet
    # specify the parameters of our model
    non_lin: torch.nn.ReLU
    hidden_units: [16, 32]
  substeps_with_separate_agent_nets: []

critic:
  # first specify the critic type (here a state value critic)
  _target_: maze.perception.models.critics.StateCriticComposer
  # specify the critic network(s) we would like to use, by reference
  networks:
    - _target_: docs.source.policy_and_value_networks.code_snippets.custom_cartpole_critic_net.CustomCartpoleCriticNet
      # specify the parameters of our model
      non_lin: torch.nn.ReLU
      hidden_units: [16, 32]

Details:

  • Models are composed by the CustomModelComposer.

  • No specific action space and probability distribution overrides are specified.

  • It specifies a probabilistic policy, the policy network to use and its constructor arguments.

  • It specifies a state critic, the value network to use and its constructor arguments.

Given this config, the resulting inference graphs are shown below:

_images/perception_custom_cartpole_policy_network.png _images/perception_custom_cartpole_critic_network.png

Example 2: Complex Networks with Perception Blocks

Now we consider the more complex example already used in the template model composer.

The observation space is defined as:

  • observation_screen : a 64 x 64 RGB image

  • observation_inventory : a 16-dimensional feature vector

The action space is defined as:

  • action_move : a categorical action with four options deciding to move [UP, DOWN, LEFT, RIGHT]

  • action_use : a 16-dimensional multi-binary action deciding which item to use from inventory

Since we will build a policy and state critic network, where both networks should have the same low level network structure we can create a common base or latent space network:

"""Shows how to use the custom model composer to build a complex custom embedding networks."""
from collections import OrderedDict
from typing import Dict, Union, Sequence, List

import torch.nn as nn

from maze.perception.blocks.feed_forward.dense import DenseBlock
from maze.perception.blocks.general.concat import ConcatenationBlock
from maze.perception.blocks.joint_blocks.lstm_last_step import LSTMLastStepBlock
from maze.perception.blocks.joint_blocks.vgg_conv_dense import VGGConvolutionDenseBlock


class CustomComplexLatentNet:
    """Simple feed forward policy network.

    :param obs_shapes: The shapes of all observations as a dict.
    :param non_lin: The nonlinear activation to be used.
    :param hidden_units: A list of units per hidden layer.
    """

    def __init__(self, obs_shapes: Dict[str, Sequence[int]],
                 non_lin: Union[str, type(nn.Module)], hidden_units: List[int]):
        self.obs_shapes = obs_shapes

        # Maze relies on dictionaries to represent the inference graph
        self.perception_dict = OrderedDict()

        # build latent feature embedding block
        self.perception_dict['latent_inventory'] = DenseBlock(
            in_keys='observation_inventory', out_keys='latent_inventory', in_shapes=obs_shapes['observation_inventory'],
            hidden_units=[128], non_lin=non_lin)

        # build latent pixel embedding block
        self.perception_dict['latent_screen'] = VGGConvolutionDenseBlock(
            in_keys='observation_screen', out_keys='latent_screen', in_shapes=obs_shapes['observation_screen'],
            non_lin=non_lin, hidden_channels=[8, 16, 32], hidden_units=[32])

        # Concatenate latent features
        self.perception_dict['latent_concat'] = ConcatenationBlock(
            in_keys=['latent_inventory', 'latent_screen'], out_keys='latent_concat',
            in_shapes=self.perception_dict['latent_inventory'].out_shapes() +
            self.perception_dict['latent_screen'].out_shapes(), concat_dim=-1)

        # Add latent dense block
        self.perception_dict['latent_dense'] = DenseBlock(
            in_keys='latent_concat', out_keys='latent_dense', hidden_units=hidden_units, non_lin=non_lin,
            in_shapes=self.perception_dict['latent_concat'].out_shapes()
        )

        # Add recurrent block
        self.perception_dict['latent'] = LSTMLastStepBlock(
            in_keys='latent_dense', out_keys='latent', in_shapes=self.perception_dict['latent_dense'].out_shapes(),
            hidden_size=32, num_layers=1, bidirectional=False, non_lin=non_lin
        )

Given this base class we can now create the policy network:

"""Shows how to use the custom model composer to build a complex custom policy networks."""
from typing import Dict, Union, Sequence, List

import numpy as np
import torch
import torch.nn as nn

from docs.source.policy_and_value_networks.code_snippets.custom_complex_latent_net import \
    CustomComplexLatentNet
from maze.perception.blocks.inference import InferenceBlock
from maze.perception.blocks.output.linear import LinearOutputBlock
from maze.perception.weight_init import make_module_init_normc


class CustomComplexPolicyNet(nn.Module, CustomComplexLatentNet):
    """Simple feed forward policy network.

    :param obs_shapes: The shapes of all observations as a dict.
    :param action_logits_shapes: The shapes of all actions as a dict structure.
    :param non_lin: The nonlinear activation to be used.
    :param hidden_units: A list of units per hidden layer.
    """

    def __init__(self, obs_shapes: Dict[str, Sequence[int]], action_logits_shapes: Dict[str, Sequence[int]],
                 non_lin: Union[str, type(nn.Module)], hidden_units: List[int]):
        nn.Module.__init__(self)
        CustomComplexLatentNet.__init__(self, obs_shapes, non_lin, hidden_units)

        # build action heads
        for action_key, action_shape in action_logits_shapes.items():
            self.perception_dict[action_key] = LinearOutputBlock(
                in_keys='latent', out_keys=action_key, in_shapes=self.perception_dict['latent'].out_shapes(),
                output_units=int(np.prod(action_shape)))

        # build inference block
        in_keys = list(self.obs_shapes.keys())
        self.perception_net = InferenceBlock(
            in_keys=in_keys, out_keys=list(action_logits_shapes.keys()), perception_blocks=self.perception_dict,
            in_shapes=[self.obs_shapes[key] for key in in_keys])

        # apply weight init
        self.perception_net.apply(make_module_init_normc(1.0))
        for action_key in action_logits_shapes.keys():
            self.perception_dict[action_key].apply(make_module_init_normc(0.01))

    def forward(self, in_tensor_dict: Dict[str, torch.Tensor]) -> Dict[str, torch.Tensor]:
        """Compute forward pass through the network.

        :param in_tensor_dict: Input tensor dict.
        :return: The computed output of the network.
        """
        return self.perception_net(in_tensor_dict)

… and the critic network:

"""Shows how to use the custom model composer to build a complex custom value networks."""
from typing import Dict, Union, Sequence, List

import torch
import torch.nn as nn

from docs.source.policy_and_value_networks.code_snippets.custom_complex_latent_net import \
    CustomComplexLatentNet
from maze.perception.blocks.inference import InferenceBlock
from maze.perception.blocks.output.linear import LinearOutputBlock
from maze.perception.weight_init import make_module_init_normc


class CustomComplexCriticNet(nn.Module, CustomComplexLatentNet):
    """Simple feed forward policy network.

    :param obs_shapes: The shapes of all observations as a dict.
    :param non_lin: The nonlinear activation to be used.
    :param hidden_units: A list of units per hidden layer.
    """

    def __init__(self, obs_shapes: Dict[str, Sequence[int]],
                 non_lin: Union[str, type(nn.Module)], hidden_units: List[int]):
        nn.Module.__init__(self)
        CustomComplexLatentNet.__init__(self, obs_shapes, non_lin, hidden_units)

        # build action heads
        self.perception_dict['value'] = LinearOutputBlock(
            in_keys='latent', out_keys='value', in_shapes=self.perception_dict['latent'].out_shapes(),
            output_units=1)

        # build inference block
        in_keys = list(self.obs_shapes.keys())
        self.perception_net = InferenceBlock(
            in_keys=in_keys, out_keys='value', in_shapes=[self.obs_shapes[key] for key in in_keys],
            perception_blocks=self.perception_dict)

        # apply weight init
        self.perception_net.apply(make_module_init_normc(1.0))
        self.perception_dict['value'].apply(make_module_init_normc(0.01))

    def forward(self, in_tensor_dict: Dict[str, torch.Tensor]) -> Dict[str, torch.Tensor]:
        """Compute forward pass through the network.

        :param in_tensor_dict: Input tensor dict.
        :return: The computed output of the network.
        """
        return self.perception_net(in_tensor_dict)

An example config for the model composer could then look like this:

# @package model

# specify the custom model composer by reference
_target_: maze.perception.models.custom_model_composer.CustomModelComposer

# Specify distribution mapping
# (here we use a default distribution mapping)
distribution_mapper_config: []

policy:
  _target_: maze.perception.models.policies.ProbabilisticPolicyComposer
  networks:
  # specify the policy network we would like to use, by reference
  - _target_: docs.source.policy_and_value_networks.code_snippets.custom_complex_policy_net.CustomComplexPolicyNet
    # specify the parameters of our model
    non_lin: torch.nn.ReLU
    hidden_units: [128]
  substeps_with_separate_agent_nets: []

critic:
  # first specify the critic type (single step in this example)
  _target_: maze.perception.models.critics.StateCriticComposer
  networks:
    # specify the critic we would like to use, by reference
    - _target_: docs.source.policy_and_value_networks.code_snippets.custom_complex_critic_net.CustomComplexCriticNet
      # specify the parameters of our model
      non_lin: torch.nn.ReLU
      hidden_units: [128]

The resulting inference graphs for a recurrent actor-critic model are shown below. Note that the models are identical except for the output layers due to the shared base model.

_images/perception_custom_complex_policy_network.png _images/perception_custom_complex_critic_network.png

Example 3: Custom Networks with (plain PyTorch) Python

Here, we take a look at how to create a custom model with plain PyTorch. As already mentioned, we still have to specify the constructor arguments obs_shapes and action_logits_shapes but not necessarily need to use them.

Important: Your models have to use dictionaries with torch.Tensors as values for both inputs and outputs.

For Gym’s CartPole the policy model could be defined like this:

"""Shows how to create a custom cartpole model using no maze perception components."""
from typing import Dict, Sequence

import torch
import torch.nn as nn
import torch.nn.functional as F


class CustomPlainCartpolePolicyNet(nn.Module):
    """Simple feed forward policy network.

    :param obs_shapes: The shapes of all observations as a dict.
    :param action_logits_shapes: The shapes of all actions as a dict structure.
    :param hidden_layer_0: The number of units in layer 0.
    :param hidden_layer_1: The number of units in layer 1.
    :param use_bias: Specify whether to use a bias in the linear layers.
    """
    def __init__(self, obs_shapes: Dict[str, Sequence[int]], action_logits_shapes: Dict[str, Sequence[int]],
                 hidden_layer_0: int, hidden_layer_1: int, use_bias: bool):
        nn.Module.__init__(self)

        self.observation_name = list(obs_shapes.keys())[0]
        self.action_name = list(action_logits_shapes.keys())[0]

        self.l0 = nn.Linear(4, hidden_layer_0, bias=use_bias)
        self.l1 = nn.Linear(hidden_layer_0, hidden_layer_1, bias=use_bias)
        self.l2 = nn.Linear(hidden_layer_1, 2, bias=use_bias)

    def reset_parameters(self) -> None:
        """Reset the parameters of the Model"""

        self.l0.reset_parameters()
        self.l1.reset_parameters()
        self.l1.reset_parameters()

    def forward(self, in_tensor_dict: Dict[str, torch.Tensor]) -> Dict[str, torch.Tensor]:
        """Compute forward pass through the network.

        :param in_tensor_dict: Input tensor dict.
        :return: The computed output of the network.
        """
        # Retrieve the observation tensor from the input dict
        xx_tensor = in_tensor_dict[self.observation_name]

        # Compute the forward pass thorough the network
        xx_tensor = F.relu(self.l0(xx_tensor))
        xx_tensor = F.relu(self.l1(xx_tensor))
        xx_tensor = self.l2(xx_tensor)

        # Create the output dictionary with the computed model output
        out = dict({self.action_name: xx_tensor})
        return out

And the critic model as:

"""Shows how to create a custom cartpole model using no maze perception components."""
from typing import Dict, Sequence

import torch
import torch.nn as nn
import torch.nn.functional as F


class CustomPlainCartpoleCriticNet(nn.Module):
    """Simple feed forward critic network.

    :param obs_shapes: The shapes of all observations as a dict.
    :param hidden_layer_0: The number of units in layer 0.
    :param hidden_layer_1: The number of units in layer 1.
    :param use_bias: Specify whether to use a bias in the linear layers.
    """
    def __init__(self, obs_shapes: Dict[str, Sequence[int]],
                 hidden_layer_0: int, hidden_layer_1: int, use_bias: bool):
        nn.Module.__init__(self)

        self.observation_name = list(obs_shapes.keys())[0]

        self.l0 = nn.Linear(4, hidden_layer_0, bias=use_bias)
        self.l1 = nn.Linear(hidden_layer_0, hidden_layer_1, bias=use_bias)
        self.l2 = nn.Linear(hidden_layer_1, 1, bias=use_bias)

    def reset_parameters(self) -> None:
        """Reset the parameters of the Model"""

        self.l0.reset_parameters()
        self.l1.reset_parameters()
        self.l1.reset_parameters()

    def forward(self, in_tensor_dict: Dict[str, torch.Tensor]) -> Dict[str, torch.Tensor]:
        """Compute forward pass through the network.

        :param in_tensor_dict: Input tensor dict.
        :return: The computed output of the network.
        """
        # Retrieve the observation tensor from the input dict
        xx_tensor = in_tensor_dict[self.observation_name]

        # Compute the forward pass thorough the network
        xx_tensor = F.relu(self.l0(xx_tensor))
        xx_tensor = F.relu(self.l1(xx_tensor))
        xx_tensor = self.l2(xx_tensor)

        # Create the output dictionary with the computed model output
        out = dict({'value': xx_tensor})
        return out

An example config for the model composer could then look like this:

# @package model

# specify the custom model composer by reference
_target_: maze.perception.models.custom_model_composer.CustomModelComposer

# Specify distribution mapping
# (here we use a default distribution mapping)
distribution_mapper_config: []

policy:
  # first specify the policy type
  _target_: maze.perception.models.policies.ProbabilisticPolicyComposer
  # specify the policy network(s) we would like to use, by reference
  networks:
  - _target_: docs.source.policy_and_value_networks.code_snippets.custom_plain_cartpole_policy_net.CustomPlainCartpolePolicyNet
    # specify the parameters of our model
    hidden_layer_0: 16
    hidden_layer_1: 32
    use_bias: True
  substeps_with_separate_agent_nets: []


critic:
  # first specify the critic type (here a state value critic)
  _target_: maze.perception.models.critics.StateCriticComposer
  # specify the critic network(s) we would like to use, by reference
  networks:
    - _target_: docs.source.policy_and_value_networks.code_snippets.custom_plain_cartpole_critic_net.CustomPlainCartpoleCriticNet
      # specify the parameters of our model
      hidden_layer_0: 16
      hidden_layer_1: 32
      use_bias: True

Note

Since we do not use the inference block in this example, no visual representation of the model can be rendered.

Example 4: Custom Shared embeddings with Perception Blocks

For this example we want to showcase the capabilities for using shared embeddings referring to Example 2: for the setup of the observations and actions. Now lets consider the case where we would like to share the observation_screen embedding only.

Here the policy will look very similar to Example 2:, using the same latent net. The only difference here is that we specify an additional out_key when creating the inference block:

"""Shows how to use the custom model composer to build a complex custom policy networks with shared embedding."""
from typing import Dict, Union, Sequence, List

import numpy as np
import torch
import torch.nn as nn

from docs.source.policy_and_value_networks.code_snippets.custom_complex_latent_net import \
    CustomComplexLatentNet
from maze.perception.blocks.inference import InferenceBlock
from maze.perception.blocks.output.linear import LinearOutputBlock
from maze.perception.weight_init import make_module_init_normc


class CustomSharedComplexPolicyNet(nn.Module, CustomComplexLatentNet):
    """Simple feed forward policy network.

    :param obs_shapes: The shapes of all observations as a dict.
    :param action_logits_shapes: The shapes of all actions as a dict structure.
    :param non_lin: The nonlinear activation to be used.
    :param hidden_units: A list of units per hidden layer.
    """

    def __init__(self, obs_shapes: Dict[str, Sequence[int]], action_logits_shapes: Dict[str, Sequence[int]],
                 non_lin: Union[str, type(nn.Module)], hidden_units: List[int]):
        nn.Module.__init__(self)
        CustomComplexLatentNet.__init__(self, obs_shapes, non_lin, hidden_units)

        # build action heads
        for action_key, action_shape in action_logits_shapes.items():
            self.perception_dict[action_key] = LinearOutputBlock(
                in_keys='latent', out_keys=action_key, in_shapes=self.perception_dict['latent'].out_shapes(),
                output_units=int(np.prod(action_shape)))

        # build inference block
        in_keys = list(self.obs_shapes.keys())
        # Specifically add 'latent_screen' as an out_key to the network, so it will get returned when calling the
        #   forward method and can be reused by the critic network.
        out_keys = list(action_logits_shapes.keys()) + ['latent_screen']
        self.perception_net = InferenceBlock(
            in_keys=in_keys, out_keys=out_keys,
            perception_blocks=self.perception_dict,
            in_shapes=[self.obs_shapes[key] for key in in_keys])

        # apply weight init
        self.perception_net.apply(make_module_init_normc(1.0))
        for action_key in action_logits_shapes.keys():
            self.perception_dict[action_key].apply(make_module_init_normc(0.01))

    def forward(self, in_tensor_dict: Dict[str, torch.Tensor]) -> Dict[str, torch.Tensor]:
        """Compute forward pass through the network.

        :param in_tensor_dict: Input tensor dict.
        :return: The computed output of the network.
        """
        return self.perception_net(in_tensor_dict)

Now for the critic, we already get the specified additional out_key as part of the obs_shapes dict and can therefore be used as such:

"""Shows how to use the custom model composer to build a complex custom value networks with shared embedding."""
from collections import OrderedDict
from typing import Dict, Union, Sequence, List

import torch
import torch.nn as nn

from maze.perception.blocks.feed_forward.dense import DenseBlock
from maze.perception.blocks.general.concat import ConcatenationBlock
from maze.perception.blocks.inference import InferenceBlock
from maze.perception.blocks.joint_blocks.lstm_last_step import LSTMLastStepBlock
from maze.perception.blocks.output.linear import LinearOutputBlock
from maze.perception.weight_init import make_module_init_normc


class CustomSharedComplexCriticNet(nn.Module):
    """Simple feed forward policy network.

    :param obs_shapes: The shapes of all observations as a dict.
    :param non_lin: The nonlinear activation to be used.
    :param hidden_units: A list of units per hidden layer.
    """

    def __init__(self, obs_shapes: Dict[str, Sequence[int]],
                 non_lin: Union[str, type(nn.Module)], hidden_units: List[int]):
        nn.Module.__init__(self)

        # Maze relies on dictionaries to represent the inference graph
        self.perception_dict = OrderedDict()

        # build latent feature embedding block
        self.perception_dict['latent_inventory'] = DenseBlock(
            in_keys='observation_inventory', out_keys='latent_inventory', in_shapes=obs_shapes['observation_inventory'],
            hidden_units=[128], non_lin=non_lin)

        # Concatenate latent features
        self.perception_dict['latent_concat'] = ConcatenationBlock(
            in_keys=['latent_inventory', 'latent_screen'], out_keys='latent_concat',
            in_shapes=self.perception_dict['latent_inventory'].out_shapes() +
                      [obs_shapes['latent_screen']], concat_dim=-1)

        # Add latent dense block
        self.perception_dict['latent_dense'] = DenseBlock(
            in_keys='latent_concat', out_keys='latent_dense', hidden_units=hidden_units, non_lin=non_lin,
            in_shapes=self.perception_dict['latent_concat'].out_shapes()
        )

        # Add recurrent block
        self.perception_dict['latent'] = LSTMLastStepBlock(
            in_keys='latent_dense', out_keys='latent', in_shapes=self.perception_dict['latent_dense'].out_shapes(),
            hidden_size=32, num_layers=1, bidirectional=False, non_lin=non_lin
        )

        # build action heads
        self.perception_dict['value'] = LinearOutputBlock(
            in_keys='latent', out_keys='value', in_shapes=self.perception_dict['latent'].out_shapes(),
            output_units=1)

        # build inference block
        in_keys = list(obs_shapes.keys())
        self.perception_net = InferenceBlock(
            in_keys=in_keys, out_keys='value', in_shapes=[obs_shapes[key] for key in in_keys],
            perception_blocks=self.perception_dict)

        # apply weight init
        self.perception_net.apply(make_module_init_normc(1.0))
        self.perception_dict['value'].apply(make_module_init_normc(0.01))

    def forward(self, in_tensor_dict: Dict[str, torch.Tensor]) -> Dict[str, torch.Tensor]:
        """Compute forward pass through the network.

        :param in_tensor_dict: Input tensor dict.
        :return: The computed output of the network.
        """
        return self.perception_net(in_tensor_dict)

The yaml file would look the same as for Example 2:, we only specify different networks.

# @package model

# specify the custom model composer by reference
_target_: maze.perception.models.custom_model_composer.CustomModelComposer

# Specify distribution mapping
# (here we use a default distribution mapping)
distribution_mapper_config: []

policy:
  _target_: maze.perception.models.policies.ProbabilisticPolicyComposer
  networks:
  # specify the policy network we would like to use, by reference
  - _target_: docs.source.policy_and_value_networks.code_snippets.custom_shared_complex_policy_net.CustomSharedComplexPolicyNet
    # specify the parameters of our model
    non_lin: torch.nn.ReLU
    hidden_units: [128]
  substeps_with_separate_agent_nets: []

critic:
  # first specify the critic type (single step in this example)
  _target_: maze.perception.models.critics.StateCriticComposer
  networks:
    # specify the critic we would like to use, by reference
    - _target_: docs.source.policy_and_value_networks.code_snippets.custom_shared_complex_critic_net.CustomSharedComplexCriticNet
      # specify the parameters of our model
      non_lin: torch.nn.ReLU
      hidden_units: [128]

The resulting inference graphs for a recurrent shared actor-critic model are shown below.

_images/perception_custom_complex_shared_policy_graph.png _images/perception_custom_complex_shared_critic_graph.png

Where to Go Next

Maze Trainers

Trainers are the central components of the Maze framework when it comes to optimizing policies using different RL algorithms. To be more specific, Trainers and TrainingRunners are responsible for the following tasks:

  • manage the model types (actor networks, state-critics, state-action-critic, …),

  • manage agent environment interaction and trajectory data generation,

  • compute the loss (specific to the algorithm used),

  • update the weights in order to decrease the loss and increase the performance,

  • collect and log training statistics,

  • manage model checkpoints and the training process (e.g., early stopping).

The figure below provides an overview of the currently supported Trainers.

_images/maze_trainers_overview.png

This page gives a general (high-level) overview of the Trainers and corresponding algorithms supported by the Maze framework. For more details especially on the implementation please refer to the API documentation on Trainers. For more details on the training workflow and how to start trainings using the Hydra config system please refer to the training section.

Supported Spaces

If not stated otherwise, Maze Trainers support dictionary spaces for both observations and actions.

If the environment you are working with does not yet interact via dictionary spaces simply wrap it with the built-in DictActionWrapper for actions and DictObservationWrapper for observations. In case of standard Gym environments just use the GymMazeEnv.

Advantage Actor-Critic (A2C)

A2C is a synchronous version of the originally proposed Asynchronous Advantage Actor-Critic (A3C). As a policy gradient method it maintains a probabilistic policy, computing action selection probabilities, as well as a critic, predicting the state value function. By setting the number of rollout steps as well as the number of parallel environments one can control the batch size used for updating the policy and value function in each iteration.

Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., & Kavukcuoglu, K. (2016). Asynchronous methods for deep reinforcement learning. In International conference on machine learning (pp. 1928-1937).

Example

$ maze-run -cn conf_train env.name=CartPole-v0 algorithm=a2c model=vector_obs critic=template_state

Algorithm Parameters | A2CAlgorithmConfig

Default parameters (maze/conf/algorithm/a2c.yaml)

# @package algorithm

# number of epochs to train
n_epochs: 0

# number of updates per epoch
epoch_length: 25

# number of steps used for early stopping
patience: 15

# number of critic (value function) burn in epochs
critic_burn_in_epochs: 0

# Number of steps taken for each rollout
n_rollout_steps: 100

# learning rate
lr: 0.0005

# discounting factor
gamma: 0.98

# weight of policy loss
policy_loss_coef: 1.0

# weight of value loss
value_loss_coef: 0.5

# weight of entropy loss
entropy_coef: 0.00025

# The maximum allowed gradient norm during training
max_grad_norm: 0.0

# Either "cpu" or "cuda"
device: cpu

# bias vs variance trade of factor for Generalized Advantage Estimator (GAE)
gae_lambda: 1.0

# Rollout evaluator (used for best model selection)
rollout_evaluator:
  _target_: maze.train.trainers.common.evaluators.rollout_evaluator.RolloutEvaluator

  # Run evaluation in deterministic mode (argmax-policy)
  deterministic: true

  # Number of evaluation trials
  n_episodes: 8

Runner Parameters | ACRunner

Default parameters (maze/conf/algorithm_runner/a2c-dev.yaml)

# @package runner
_target_: "maze.train.trainers.common.actor_critic.actor_critic_runners.ACDevRunner"

# model class used for policy and critic updates
trainer_class: maze.train.trainers.a2c.a2c_trainer.A2C

# Number of concurrently executed environments
concurrency: 0

# Number of concurrent evaluation envs
eval_concurrency: 0

Default parameters (maze/conf/algorithm_runner/a2c-local.yaml)

# @package runner
_target_: "maze.train.trainers.common.actor_critic.actor_critic_runners.ACLocalRunner"

# model class used for policy and critic updates
trainer_class: maze.train.trainers.a2c.a2c_trainer.A2C

# Number of concurrently executed environments
concurrency: 0

# Number of concurrent evaluation envs
eval_concurrency: 0

Proximal Policy Optimization (PPO)

The PPO algorithm belongs to the class of actor-critic style policy gradient methods. It optimizes a “surrogate” objective function adopted from trust region methods. As such, it alternates between generating trajectory data via agent rollouts from the environment and optimizing the objective function by means of a stochastic mini-batch gradient ascent.

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.

Example

$ maze-run -cn conf_train env.name=CartPole-v0 algorithm=ppo model=vector_obs critic=template_state

Algorithm Parameters | PPOAlgorithmConfig

Default parameters (maze/conf/algorithm/ppo.yaml)

# @package algorithm

# number of epochs to train
n_epochs: 0

# number of updates per epoch
epoch_length: 25

# number of steps used for early stopping
patience: 15

# number of critic (value function) burn in epochs
critic_burn_in_epochs: 0

# Number of steps taken for each rollout
n_rollout_steps: 100

# learning rate
lr: 0.00025

# discounting factor
gamma: 0.98

# bias vs variance trade of factor for Generalized Advantage Estimator (GAE)
gae_lambda: 1.0

# weight of policy loss
policy_loss_coef: 1.0

# weight of value loss
value_loss_coef: 0.5

# weight of entropy loss
entropy_coef: 0.00025

# The maximum allowed gradient norm during training
max_grad_norm: 0.0

# Either "cpu" or "cuda"
device: cpu

# The batch size used for policy and value updates
batch_size: 100

# Number of epochs for policy and value optimization
n_optimization_epochs: 4

# Clipping parameter of surrogate loss
clip_range: 0.2

# Rollout evaluator (used for best model selection)
rollout_evaluator:
  _target_: maze.train.trainers.common.evaluators.rollout_evaluator.RolloutEvaluator

  # Run evaluation in deterministic mode (argmax-policy)
  deterministic: true

  # Number of evaluation trials
  n_episodes: 8

Runner Parameters | ACRunner

Default parameters (maze/conf/algorithm_runner/ppo-dev.yaml)

# @package runner
_target_: "maze.train.trainers.common.actor_critic.actor_critic_runners.ACDevRunner"

# model class used for policy and critic updates
trainer_class: maze.train.trainers.ppo.ppo_trainer.PPO

# Number of concurrently executed environments
concurrency: 0

# Number of concurrent evaluation envs
eval_concurrency: 0

Default parameters (maze/conf/algorithm_runner/ppo-local.yaml)

# @package runner
_target_: "maze.train.trainers.common.actor_critic.actor_critic_runners.ACLocalRunner"

# model class used for policy and critic updates
trainer_class: maze.train.trainers.ppo.ppo_trainer.PPO

# Number of concurrently executed environments
concurrency: 0

# Number of concurrent evaluation envs
eval_concurrency: 0

Importance Weighted Actor-Learner Architecture (IMPALA)

IMPALA is a RL algorithm able to scale to a very large number of machines. Multiple workers collect trajectories (sequences of states, actions and rewards), which are communicated to a learner responsible for updating the policy by utilizing stochastic mini-batch gradient decent and the proposed V-trace correction algorithm. By decoupling rollouts (interactions with the environment) and policy updates the algorithm is considered off-policy and asynchronous, making it very suitable for compute-intense environments.

Espeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih, V., Ward, T., Doron, Y., Firoiu, V., Harley, T., Dunning, I., Legg, S., & Kavukcuoglu, K. (2018). Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. arXiv preprint arXiv:1802.01561.

Example

$ maze-run -cn conf_train env.name=CartPole-v0 algorithm=impala model=vector_obs critic=template_state

Algorithm Parameters | ImpalaAlgorithmConfig

Default parameters (maze/conf/algorithm/impala.yaml)

# @package algorithm

# Common Actor critic parameters

# number of epochs to train
n_epochs: 0

# number of updates per epoch
epoch_length: 25

# number of steps used for early stopping
patience: 15

# number of critic (value function) burn in epochs
critic_burn_in_epochs: 0

# number of rolloutstep of each epoch substep
n_rollout_steps: 100

# learning rate
lr: 0.0005

# discount factor
gamma: 0.98

# coefficient of the policy used in the loss calculation
policy_loss_coef: 1.0

# coefficient of the value used in the loss calculation
value_loss_coef: 0.5

# coefficient of the entropy used in the loss calculation
entropy_coef: 0.01

# max grad norm for gradient clipping, ignored if value==0
max_grad_norm: 0

# Device of the learner (either cpu or cuda)
# Note that the actors collecting rollouts are always run on CPU.
device: "cpu"

# Impala specific events ----------------------

# this factor multiplied by the actor_batch_size gives the size of the queue for
# the agents output collected by the learner. Therefor if the all rollouts computed can be at most
# (queue_out_of_sync_factor + num_agents/actor_batch_size) out of sync with learner policy
queue_out_of_sync_factor: 1

# number of actors to combine to one batch
actors_batch_size: 8

# number of actors to be run
num_actors: 8

# A scalar float32 tensor with the clipping threshold for importance weights
# (rho) when calculating the baseline targets (vs). rho^bar in the paper. If None, no clipping is applied.
vtrace_clip_rho_threshold: 1.0

# A scalar float32 tensor with the clipping threshold on rho_s in
# \rho_s \delta log \pi(a|x) (r + \gamma v_{s+1} - V(x_sfrom_importance_weights)). If None, no clipping is
# applied.
vtrace_clip_pg_rho_threshold: 1.0

# Rollout evaluator (used for best model selection)
rollout_evaluator:
  _target_: maze.train.trainers.common.evaluators.rollout_evaluator.RolloutEvaluator

  # Run evaluation in deterministic mode (argmax-policy)
  deterministic: true

  # Number of evaluation trials
  n_episodes: 8

Runner Parameters | ImpalaRunner

Default parameters (maze/conf/algorithm_runner/impala-dev.yaml)

# @package runner
_target_: "maze.train.trainers.impala.impala_runners.ImpalaDevRunner"

# Number of concurrent evaluation envs
eval_concurrency: 0

Default parameters (maze/conf/algorithm_runner/impala-local.yaml)

# @package runner
_target_: "maze.train.trainers.impala.impala_runners.ImpalaLocalRunner"

# type of startmethod used for multiprocessing: 'forkserver', 'spawn', 'fork', 'dummy'
start_method: forkserver

# Number of concurrent evaluation envs
eval_concurrency: 0

Soft Actor-Critic (from Demonstrations) (SAC, SACfD)

An off-policy actor-critic algorithm based on the maximum entropy reinforcement learning framework with the goal of maximizing expected reward while at the same time maximizing entropy. SAC exhibits high sample efficiency, is stable across different random seeds, and achieves competitive performance especially for continuous control tasks. In contrast to A2C, PPO and IMPALA it utilizes a stochastic state-action critic.

Additionally, our implementation allows to initialize the replay buffer from existing demonstrations (e.g., rollouts) instead of sampling the initial transitions with the given sampling policy (per default random). This variant is called Soft Actor-Critic from Demonstrations.

Haarnoja, T., Zhou, A., Abbeel, P., Levine, S. (2018). Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. arXiv preprint arXiv:1801.01290., Haarnoja, T., Zhou, A., Hartikainen, K., Tucker, G., Ha, S., Tan, J., … & Levine, S. (2018). Soft Actor-Critic Algorithms and Applications. arXiv preprint arXiv:1812.05905., Christodoulou, P. (2019). Soft actor-critic for discrete action settings arXiv preprint arXiv:1910.07207.

Example SAC

$ maze-run -cn conf_train env.name=Pendulum-v0 algorithm=sac model=vector_obs critic=template_state_action

Example SACfD

$ maze-run env.name=LunarLander-v2 policy=lunar_lander_heuristics runner.n_episodes=1000
$ maze-run -cn conf_train env.name=LunarLander-v2 algorithm=sacfd model=flatten_concat critic=flatten_concat_state_action runner.initial_demonstration_trajectories.input_data=<absolute_experiment_path>/trajectory_data

Algorithm Parameters | SACAlgorithmConfig

Default parameters (maze/conf/algorithm/sac.yaml)

# @package algorithm

# Number of steps taken for each rollout
n_rollout_steps: 1

# Learning rate
lr: 0.001

# The entropy coefficient to use in the loss computation (called alpha in org paper)
entropy_coef: 0.2

# Discounting factor
gamma: 0.99

# The maximum allowed gradient norm during training
max_grad_norm: 0.0

# Number of actors to combine to one batch
batch_size: 100

# Number of batches to update on in each iteration
num_batches_per_iter: 1

# Number of actors to be run
num_actors: 1

# Parameter weighting the soft update of the target network
tau: 0.005

# Specify in what intervals to update the target networks
target_update_interval: 1

# Either "cpu" or "cuda"
device: cpu

# Specify whether to learn the entropy coefficient or rather use the default one (entropy_coef) [called alpha in paper]
entropy_tuning: true

# Specify an optional multiplier for the target entropy. This value is multiplied with the default target entropy
# computation (this is called alpha tuning in the org paper):
#   discrete spaces: target_entropy = target_entropy_multiplier * ( - 0.98 * (-log (1 / |A|))
#   continues spaces: target_entropy = target_entropy_multiplier * (- dim(A)) (e.g., -6 for HalfCheetah-v1)
target_entropy_multiplier: 1.0

# Learning rate for entropy tuning
entropy_coef_lr: 0.0007

# The size of the replay buffer
replay_buffer_size: 1000000

# The initial buffer size, where transaction are sampled with the initial sampling policy
initial_buffer_size: 10000

# The policy used to initially fill the replay buffer
initial_sampling_policy:
  _target_: maze.core.agent.random_policy.RandomPolicy

# Number of rollouts collected from the actor in each iteration
rollouts_per_iteration: 1

# Specify whether all computed rollouts should be split into transitions before processing them
split_rollouts_into_transitions: true

# Number of epochs to train
n_epochs: 0

# Number of updates per epoch
epoch_length: 100

# Number of steps used for early stopping
patience: 50

# Rollout evaluator (used for best model selection)
rollout_evaluator:
  _target_: maze.train.trainers.common.evaluators.rollout_evaluator.RolloutEvaluator

  # Run evaluation in deterministic mode (argmax-policy)
  deterministic: true

  # Number of evaluation trials
  n_episodes: 8

Runner Parameters SAC | SACRunner

Default parameters (maze/conf/algorithm_runner/sac-dev.yaml)

# @package runner
_target_: "maze.train.trainers.sac.sac_runners.SACDevRunner"

# Number of concurrent evaluation envs
eval_concurrency: 0

# Specify the Dataset class used to load the trajectory data for training, otherwise the initial replay buffer is
#   sampled with the provided initial_sampling_policy
initial_demonstration_trajectories: ~

Default parameters (maze/conf/algorithm_runner/sac-local.yaml)

# @package runner
_target_: "maze.train.trainers.sac.sac_runners.SACDevRunner"

# Number of concurrent evaluation envs
eval_concurrency: 0

# Specify the Dataset class used to load the trajectory data for training, otherwise the initial replay buffer is
#   sampled with the provided initial_sampling_policy
initial_demonstration_trajectories: ~

Runner Parameters SACfD | SACRunner

Default parameters (maze/conf/algorithm_runner/sacfd-dev.yaml)

# @package runner
_target_: "maze.train.trainers.sac.sac_runners.SACDevRunner"

# Number of concurrent evaluation envs
eval_concurrency: 0

# Specify the dataset class used to load the trajectory data for training, otherwise the initial replay buffer is
# sampled with the provided initial_sampling_policy
initial_demonstration_trajectories:
  _target_: maze.core.trajectory_recording.datasets.in_memory_dataset.InMemoryDataset
  input_data: trajectory_data
  n_workers: 1
  deserialize_in_main_thread: false
  trajectory_processor:
    _target_: maze.core.trajectory_recording.datasets.trajectory_processor.IdentityWithNextObservationTrajectoryProcessor

Default parameters (maze/conf/algorithm_runner/sacfd-local.yaml)

# @package runner
_target_: "maze.train.trainers.sac.sac_runners.SACDevRunner"

# Number of concurrent evaluation envs
eval_concurrency: 0

# Specify the dataset class used to load the trajectory data for training, otherwise the initial replay buffer is
# sampled with the provided initial_sampling_policy
initial_demonstration_trajectories:
  _target_: maze.core.trajectory_recording.datasets.in_memory_dataset.InMemoryDataset
  input_data: trajectory_data
  n_workers: 5
  deserialize_in_main_thread: false
  trajectory_processor:
    _target_: maze.core.trajectory_recording.datasets.trajectory_processor.IdentityWithNextObservationTrajectoryProcessor

Behavioural Cloning (BC)

Behavioural cloning is a simple imitation learning algorithm, that infers the behaviour of a “hidden” policy by imitating the actions produced for a given observation in a supervised learning setting. As such, it requires a set of training (example) trajectories collected prior to training.

Hussein, A., Gaber, M. M., Elyan, E., & Jayne, C. (2017). Imitation learning: A survey of learning methods. ACM Computing Surveys (CSUR), 50(2), 1-35.

Example: Imitation Learning and Fine-Tuning

Algorithm Parameters | BCAlgorithmConfig

Default parameters (maze/conf/algorithm_runner/bc.yaml)

# @package algorithm

# Number of epochs to train for
n_epochs: 1000

# Optimizer used to update the policy
optimizer:
  _target_: torch.optim.Adam
  lr: 0.001

# Device to train on
device: cuda

# Batch size
batch_size: 100

# Number of iterations after which to run evaluation (in addition to evaluations at the end of
# each epoch, which are run automatically). If set to None, evaluations will run on epoch end only.
eval_every_k_iterations: 500

# Percentage of the data used for validation.
validation_percentage: 20

# Number of episodes to run during each evaluation rollout (set to 0 to evaluate using validation only)
n_eval_episodes: 8

# Entropy coefficient for policy optimization.
entropy_coef: 0.0

Runner Parameters | BCRunner

Default parameters (maze/conf/algorithm_runner/bc-dev.yaml)

# @package runner
_target_: maze.train.trainers.imitation.bc_runners.BCDevRunner

# Number of concurrent evaluation envs
eval_concurrency: 0

# Specify the Dataset class used to load the trajectory data for training
dataset:
  _target_: maze.core.trajectory_recording.datasets.in_memory_dataset.InMemoryDataset
  input_data: trajectory_data
  n_workers: 1
  deserialize_in_main_thread: false
  trajectory_processor:
    _target_: maze.core.trajectory_recording.datasets.trajectory_processor.IdentityTrajectoryProcessor

Default parameters (maze/conf/algorithm_runner/bc-local.yaml)

# @package runner
_target_: maze.train.trainers.imitation.bc_runners.BCLocalRunner

# Number of concurrent evaluation envs
eval_concurrency: 0

# Specify the Dataset class used to load the trajectory data for training
dataset:
  _target_: maze.core.trajectory_recording.datasets.in_memory_dataset.InMemoryDataset
  input_data: trajectory_data
  n_workers: 5
  deserialize_in_main_thread: false
  trajectory_processor:
    _target_: maze.core.trajectory_recording.datasets.trajectory_processor.IdentityTrajectoryProcessor

Evolutionary Strategies (ES)

Evolutionary strategies is a black box optimization algorithm that utilizes direct policy search and can be very efficiently parallelized. Advantages of this methods include being invariant to action frequencies as well as delayed rewards. Further, it shows tolerance for extremely long time horizons, since it does need to compute or approximate a temporally discounted value function. However, it is considered less sample efficient then actual RL algorithms.

Salimans, T., Ho, J., Chen, X., Sidor, S., & Sutskever, I. (2017). Evolution strategies as a scalable alternative to reinforcement learning. arXiv preprint arXiv:1703.03864.

Example

$ maze-run -cn conf_train env.name=CartPole-v0 algorithm=es model=vector_obs

Algorithm Parameters | ESAlgorithmConfig

Default parameters (maze/conf/algorithm_runner/es.yaml)

# @package algorithm

# Minimum number of episode rollouts per training iteration (=epoch)
n_rollouts_per_update: 10

# Minimum number of cumulative env steps per training iteration (=epoch).
# The training iteration is only finished, once the given number of episodes
# AND the given number of steps has been reached. One of the two parameters
# can be set to 0.
n_timesteps_per_update: 0

# The number of epochs to train before termination. Pass 0 to train indefinitely
n_epochs: 0

# Limit the episode rollouts to a maximum number of steps. Set to 0 to disable this option.
max_steps: 0

# The optimizer to use to update the policy based on the sampled gradient.
optimizer:
  _target_: maze.train.trainers.es.optimizers.adam.Adam
  step_size: 0.01

# L2 weight regularization coefficient.
l2_penalty: 0.005

# The scaling factor of the random noise applied during training.
noise_stddev: 0.02

# Support for simulation logic or heuristics on top of a TorchPolicy.
policy_wrapper: ~

Runner Parameters | ESMasterRunner

Default parameters (maze/conf/algorithm_runner/es-dev.yaml)

# @package runner
_target_: "maze.train.trainers.es.ESDevRunner"

# Fixed number of evaluation runs per epoch.
n_eval_rollouts: 10

# Number of float values in the deterministically generated pseudo-random table
shared_noise_table_size: 100000000

Maze RLlib Trainer

Finally, the Maze framework also contains an RLlib trainer class. This special class of trainers wraps all necessary and convenient Maze components into RLlib compatible objects such that Ray-RLlib can be reused to train Maze policies and critics. This enables us to train Maze Models with Maze action distributions in Maze environments with almost all RLlib algorithms.

RLlib algorithms are currently not supported by RunContext.

Example and Details: Maze RLlib Runner

Where to Go Next

Maze RLlib Runner

The RLlib Runner allows you to use RLlib Trainers in combination with Maze models and environments. Ray-RLlib is one of the most popular RL frameworks (algorithm collections) within the scientific community but also when it comes to practical relevance. It already comprises an extensive and tuned collection of various different RL training algorithms. To gain access to RLlib’s algorithm collection while still having access to all of practical Maze features we introduce the Maze Rllib Module. It basically wraps Maze models (including our extensive Perception Module), Maze environments (including wrappers) as well as the customizable Maze action distributions. It further allows us to use the Maze hydra cmd-line interfaces together with RLlib while at the same time using the well optimized algorithms from RLlib.

This page gives an overview of the RLlib module and provides examples on how to apply it.

_images/maze_rllib_overview_simple.png

List of Features

  • Use Maze environments, models and action distributes in conjunction with RLlib algorithms.

  • Make full use of the Maze environment customization utils (wrappers, pre-processing, …).

  • Use the hydra cmd-line interface to start training runs.

  • Models trained with the Maze RLlib Runner are fully compatible with the remaining framework (except when using the default RLlib models).

Example 1: Training with Maze-RLlib and Hydra

Using RLlib algorithms with Maze and Hydra works analogously to starting training with native Maze Trainers. To train the CartPole environment with RLlib’s PPO, run:

$ maze-run -cn conf_rllib env.name=CartPole-v0 rllib/algorithm=ppo

Here the -cn conf_rllib argument specifies to use the conf_rllib.yaml (available in maze-rllib) package, as our root config file. It specifies the way how to use RLlib trainers within Maze. (For more on root configuration files, see Hydra overview.)

Example 2: Overwriting Training Parameters

Similar to native Maze trainers, the parametrization of RLlib training runs is also done via Hydra. The main parameters for customizing training and are:

  • Environment (env configuration group), configuring which environment the training runs on, this stays the same as in maze-train for example.

  • Algorithm (rllib/algorithm configuration group), specifies the algorithm and its configuration (all supported algorithms).

  • Model (model configuration group), specifying how the models for policies and (optionally) critics should be assembled, this also stays the same as in maze-train.

  • Runner (rllib/runner configuration group), specifies how training is run (e.g. locally, in development mode). The runner is also the main object responsible for administering the whole training run.. The runner is also the main object responsible for administering the whole training run.

To train with a different algorithm we simply have to specify the rllib/algorithm parameter:

$ maze-run -cn conf_rllib env.name=CartPole-v0 rllib/algorithm=a3c

Furthermore, we have full access to the algorithm hyper parameters defined by RLlib and can overwrite them. E.g., to change the learning rate and rollout fragment length, execute

$ maze-run -cn conf_rllib env.name=CartPole-v0 rllib/algorithm=a3c \
  algorithm.config.lr=0.001 algorithm.config.rollout_fragment_length=50

Example 3: Training with RLlib’s Default Models

Finally, it is also possible to utilize the RLlib default model builder by specifying model=rllib. This will load the rllib default model and parameters, which can again be customized via Hydra:

$ maze-run -cn conf_rllib env.name=CartPole-v0 model=rllib \
  model.fcnet_hiddens=[128,128] model.vf_share_layers=False

Supported Algorithms

The Bigger Picture

The figure below shows an overview of how the RLlib Module connects to the different Maze components in more detail:

_images/maze_rllib_overview.png

Good to Know

Tip

Using the the argument rllib/runner=dev starts ray in local mode, by default sets the number workers to 1 and increases the log level (resulting in more information being printed). This is especially useful for debugging.

Tip

When watching the training progress of RLlib training runs with Tensorboard make sure to start Tensorboard with --reload_multifile true as both Maze and RLlib will dump an event log.

Where to Go Next

Policies, Critics and Agents

Depending on the domain and task we are working on we rely on different trainers (algorithm classes) to appropriately address the problem at hand. This in turn implies different requirements for the respective models and contained policy and value estimators.

The figure below provides a conceptual overview of the model classes contained in Maze and relates them to compatible algorithm classes and trainers.

_images/policy_critic_overview.png

Note that all policy and critics are compatible with Structured Environments.

Policies (Actors)

An agent holds one or more policies and acts (selects actions) according to these policies. Each policy consists of one ore more policy networks. This might be for example required in (1) multi-agent RL settings where each agents acts according to its distinct policy network or (2) when working with auto-regressive action spaces or multi-step environments.

In case of Policy Gradient Methods, such as the actor-critic learners A2C or PPO, we rely on a probabilistic policy defining a conditional action selection probability distribution \(\pi(a|s)\) given the current State \(s\).

In case of value-based methods, such as DQN, the Q-policy is defined via the state-action value function \(Q(s, a)\) (e.g, by selecting the action with highest Q value: \(\mathrm{argmax}_a Q(s, a)\)).

Value Functions (Critics)

Maze so far supports two different kinds of critics. A standard state critic represented via a scalar value function \(V(S)\) and a state-action critic represented either via a scalar state-action value function \(Q(S, A)\) or its vectorized equivalent \(Q(S)\) predicting the state-action values for all actions at once.

Analogously to policies each critic holds one or more value networks depending on the current usage scenario we are in (auto-regressive, multi-step, multi-agent, …). The table below provides an overview of the different critics styles.

State Critic \(V(S)\)

TorchStepStateCritic

Each sub-step or actor gets its individual state critic.

TorchSharedStateCritic

One state critic is shared across all sub-steps or actors.

State-Action Critic \(Q(S, A)\)

TorchStepStateActionCritic

Each sub-step or actor gets its individual state-action critic.

TorchSharedStateActionCritic

One state-action critic is shared across all sub-steps or actors.

Actor-Critics

To conveniently work with algorithms such as A2C, PPO, IMPALA or SAC we provide a TorchActorCritic model to unifying the different policy and critic types into one model.

Where to Go Next

Maze Environment Hierarchy

When working with an environment, it is desirable to maintain some modularity in order to be able to, for example, test different configurations of action and observation spaces, modify or record rollouts, or turn an existing flat environment into a structured one.

This page explains how Maze achieves such modularity by breaking down the Maze environment into smaller components and utilizing wrappers. It also provides a high-level overview of what you need to do to use a new or existing custom environment with Maze. (You can find guidance on that at the end of each section.)

For more references on the individual components or on how to write a new environment from scratch, see the Where to Go Next section at the end.

_images/observation_action_interfaces.png

The following sections describe the main components:

  • Core environment, which implements the main environment mechanics, and works with MazeState and MazeAction objects.

  • Observation- and ActionConversionInterfaces which turn MazeState and MazeAction objects (custom to the core environment) into actions and observations (instances of Gym-compatible spaces which can be fed into a model).

  • Maze env, which encapsulates the core environment and implements functionality common to all environments above it (e.g. manages observation and action conversion).

  • Wrappers, which add a great degree of flexibility by allowing you to encapsulate the environment and observe or modify its behavior.

  • Structured environment interface, which Maze uses to model more complex scenarios such as multi-step (auto-regressive), multi-agent or hierarchical settings.

Here, we explain what parts a Maze environment is composed of and how to apply wrappers.

Core Environment

Core environment implements the main mechanics and functionality of the environment. Its interface is compatible with the Gym environment interface with functions such as step and reset.

The step function of the core environment takes an MazeAction object and returns a MazeState object. There are no strict requirements on how these objects should look – their structure is dependent on the needs of the core environment. However, note that these objects should be serializable, so that they can be easily recorded as part of trajectory data and inspected later.

Besides the Gym interface, core environment interface also contains a couple of hooks that make it easy to support various features of maze, like recording trajectory of your rollouts and then replaying these in a Jupyter notebook. These method include, e.g., get_renderer() and get_serializable_components(). You don’t have to use these if you don’t need them (e.g. just return an empty dictionary to get_serializable_components() if there are no additional components you would like to serialize with trajectory data) – but then, some features of Maze might not be available.

If you want to use a new or existing environment with Maze, core environment is where you should start. Implement the core environment interface in your environment or encapsulate your environment in an core environment subclass.

Gym-Space Interfaces

Observation- and ActionConversionInterfaces translate MazeState and MazeAction objects (custom to the core environment) into actions and observations (instances of Gym-compatible spaces, i.e., usually a dictionary of numpy arrays which can be fed into a model) and vice versa.

It makes sense to extract this functionality in a separate objects, as format of actions and observations often needs to be swapped (to allow for different trained policies or heuristics). Treating space interfaces as separate objects encapsulates their configuration and separates it from the core environment functionality (which does not need to be changed when only, e.g., the format of the action space is being changed).

If you are creating a new Maze environment, you will need to implement at least one pair of interfaces – one for conversion of MazeStates into observations that your models can later consume, and other one for converting the actions produced by the model to the MazeActions your environment works with.

For more information on the space interfaces and how to customize your environment with them, refer to Customizing Core and Maze Environments.

Maze Environment

Maze environment encapsulates the core environment together with the space interfaces. Here, the functionality shared across all core environments is implemented – like management of the space interfaces, support for statistics and logging, and else.

Maze environment is the smallest unit that an RL agent can interact with, as it encapsulates the core functionality implemented by the core environment, space interfaces that translate the MazeState and MazeAction so that the model can understand it, and support for other optional features of Maze that you can add (like statistics logging).

If you are creating a new environment, you will likely not need to think of the Maze environment class much, as it is mostly concerned with functionalities shared across all Maze environments. You will still need to subclass it to have a distinct Maze environment class for your environment, but usually it is enough to override the initializer, there is no need to modify any of its other functionalities.

Wrappers

(This section provides an overview. See also Wrappers for more details.)

Wrappers are a very flexible way how to modify behavior of an environment. As the name implies, a wrapper encapsulates the whole environment in it. This means that the wrapper has complete control over the behavior of the environment and can modify it as suited.

Note also that another wrapper can also be applied to an already wrapped environment. In this case, each method call (such as step) will traverse through the whole wrapper stack, from the outer-most wrapper to the Maze env, with each wrapper being able to intercept and modify the call.

Maze provides superclasses for commonly used wrapper types:

  • ObservationWrapper can manipulate the observation before it reaches the agent. Observation wrappers are used for example for observation normalization wrapper or masking. Usually, this is the most common type of wrapper used.

  • RewardWrapper can manipulate the reward before it reaches the model.

  • ActionWrapper can manipulate the action the model produced before it is converted using ActionConversionInterface in Maze environment.

  • Wrapper is the common superclass of all the wrappers listed above. It can be subclassed directly if you need to provide some more elaborate functionality, like turning your flat environment into a structured multi-step one

If you are creating a new Maze environment, wrappers are optional. Unless you have some very special needs, the wide variety of wrappers provided by Maze (like observation normalization wrapper or trajectory recording wrapper) should work with any Maze environment out of the box. However, you might need to implement a custom wrapper if you want to modify the behavior of your environment in some more customized manner, like turning your flat environment into a structured multi-step one.

For more information on wrappers and customization, see Wrappers.

Structured Environments

This section provides an high-level overview of structured environments. See Control Flows with Structured Environments for more details and examples.

Maze uses the StructuredEnv concept to model more complex settings, such as multi-step (auto-regressive), multi-agent or hierarchical settings.

While such settings can indeed be quite complex, the StructuredEnv interface itself is rather simple under the hood. In summary, during each step in the environment:

  1. The agent needs to ask which policy should act next. The environment exposes this using the actor_id method.

  2. The agent then should evaluate the observation using the policy corresponding to the current actor ID, and issue the desired action using the step function in a usual Gym-like manner.

Note that the Actor ID, which identifies the currently active policy, is composed of (1) the sub-step key and (2) the index of the current actor in scope of this sub-step (as in some settings, there might be multiple actors per sub-step key).

Maze uses the StructuredEnv interface in all settings by default, and other Maze components like TorchPolicy support it (and make it convenient to work with) out of the box.

Where to Go Next

After understanding how Maze environment hierarchy works, you might want to:

Also, note that the classes described above (like Core environment and Maze environment) themselves implement a set of interfaces that facilitate some of Maze functions, like EventEnvMixin interfacing the Event system or RenderEnvMixin facilitating rendering. You will likely not need to work with these directly, and hence they are not described here in detail. However, if you need to know more about these, head to the reference documentation.

Maze Event System

The Maze event system is a convenient way how to monitor notable events happening in the environment as the agent is interacting with it. This page explains the motivation behind it, gives an overview of how it is used in Maze (pointing to other relevant sections), and briefly explains how it works under the hood.

Motivation

Standard metrics such as reward and step count provide a high-level overview of how an agent is doing in an environment, but don’t provide more detailed information on the actual behavior.

On the other hand, visualizing or otherwise inspecting the full environment state gives very detailed information on the behavior in some particular time frame, but is difficult to compare and aggregate across episodes or training runs.

In Maze, event system fills the space between – providing more information about environment mechanics than just watching the reward, while making it easy to log, understand, and compare it across episodes and rollouts.

What is an event?

As the name suggests, an event is something notable that happens during the agent-environment interaction loop. For example, when the inventory is full in the example 2D cutting env, a piece will be discarded and the corresponding event will be fired:

self.inventory_events.piece_discarded(piece=(50, 30))

As can be seen above, events carry a descriptive name, encapsulate the details (like the dimensions of the discarded piece), and are part of a topic (like “inventory events”).

While there are some general events that apply to all environments (like reward-related events or KPIs), in general, environments declare their own topics and events as they see fit.

To understand how to declare and integrate custom events into your environment, see the adding events and KPIs tutorial.

How are events used in Maze?

There are three main things events are used for throughout Maze:

  1. Reward aggregation. Reward aggregators declare which events they desire to observe, and then calculate the reward on top of them. This makes it possible to keep reward aggregators decoupled from the environment, which means they can be configured and changed easily. (Check out reward aggregation and the tutorial for more information.)

  2. Statistics and KPIs. Event declarations can be annotated using decorators which specify how they should be aggregated on different levels (i.e., step, episode, and epoch). The statistics system then aggregates the events into statistics during trainings and rollouts, and displays these statistics in Tensorboard and console. This makes it possible to understand the agent’s behavior much better than if only high-level statistics such as reward and step count were observed. (For more information, see how statistics are logged and calculated.)

  3. Raw event data logging. Events and their details are logged in CSV format, which makes them easy to access and analyze later via any custom tools. (While the CSV format should be suitable for most data-analysis tools out there, it is also possible to extend the logging functionality via custom writers if needed.)

For any other custom needs, it is possible to plug into the event system directly through the Pubsub or EventEnvMixin interfaces.

PubSub: Dispatching and Observing Events

Each core environment maintains its own Pubsub message broker (stands for publisher-subscriber). Using the broker, it is possible to register event topics (created as described in the tutorial), register subscribers (which will then collect the dispatched events), and dispatch events themselves.

# In a core env (which maintains a pubsub broker)

# Create a topic
inventory_events = self.pubsub.create_event_topic(InventoryEvents)

# Register a subscriber (can be a reward aggregator
# or any other class implementing the Subscriber interface)
self.pubsub.register_subscriber(my_subscriber)

# Dispatch an event
inventory_events.piece_discarded(piece=(50, 10))

Note that the subscriber must implement the Subscriber interface and declare which events it want to be notified about. This pattern is used by RewardAggregators, and the tutorial on adding reward aggregation is also a good place to start for any other custom needs.

EventEnvMixin Interface: Querying Events

Core environment also records all events dispatched during the last time step and makes them accessible using the EventEnvMixin interface. If you only need to query events dispatched during the last timestep, this option might be more lightweight than registering with the Pubsub message broker.

env.get_step_events()

To see the interface in action, you might want to check out the LogStatsWrapper, which uses this interface to query events for aggregation.

Where to Go Next

After understanding the main concepts of the event system, you might want to:

Configuration with Hydra

Here, we explain the configuration scheme of the Maze framework, which explain how to configure your environment and other components using YAML files, run your experiments via CLI, and customize the runs via CLI overrides.

The Maze framework utilizes the Hydra configuration framework. These pages aim to give you a quick overview of how Maze uses Hydra and what its capabilities are, so that you can get up to speed quickly without prior Hydra knowledge:

Hydra: Overview

The motivation behind using Hydra is primarily to:

  • Keep separate components (e.g., environment, policy) in individual YAML files which are easier to understand

  • Run multiple experiments with different components (like using two different environment configurations, or training with PPO vs. A2C) without duplicating the whole config file

  • Make components/values different from the defaults immediately visible (with, e.g., maze-run runner=sequential)

Below, the core concepts of Hydra as Maze uses it are described:

  • Introduction explains the core concepts of assembling configuration with Hydra

  • Config Root & Defaults explains how the root config file works and how default components are specified

  • Overrides show how you can easily customize the config parameters without duplicating the config file, and have Hydra assemble the config and log it for you

  • Output Directory shows how Hydra creates separate directories for your runs automatically. It is a bit separated from the previous concepts but still important for running your jobs.

  • Runner concept section explains how the Hydra config is handled by Maze to launch various kinds of jobs (like rollout or train runs) with different configurations

Introduction

Hydra is a configuration framework that, in short, allows you to:

  1. Break down your configuration into multiple smaller YAML files (representing individual components)

  2. Launch your job through CLI providing overrides for individual components or values and have Hydra assemble the config for you

Ad (1): For illustrative purposes, this is an example of how your Hydra config structure can look like:

_images/hydra_structure.png

Ad (2): With the structure above, you could then launch your jobs with specified components (again, this is only for illustrative purposes):

$ maze-run runner=parallel

Or, you can even override any individual value anywhere in the config like this:

$ maze-run runner=parallel runner.n_processes=10

You can also review the basic example and composition example at Hydra docs.

Configuration Root, Groups and Defaults

The starting place for a Hydra config is the root configuration file. It lists (1) the individual configuration groups that you would like to use along with their defaults, and (2) any other configuration that is universal. A simple root config file might look like this (all of these examples are snippets taken from maze config, shortened for brevity):

# These are the individual config components with their defaults
defaults:
  - runner: parallel
  - env: cartpole
  - wrappers: default
    optional: true
  # ...

# Other values that are universally applicable (still can be changed with overrides)
log_base_dir: outputs

# ...

The snippet runner: parallel tells Hydra to look for a file runner/parallel.yaml and transplant its contents under the runner: key. (If optional: true is specified, Hydra does not raise an error when such a config file cannot be found.)

Hence, if the runner/parallel.yaml file looks like this:

n_processes: 5
n_episodes: 50
# ...

the final assembled config would look like this:

runner:
n_processes: 5
  n_episodes: 50
  # ...
env:
  # ...

Overrides

When running your job through a command line, you can customize individual bits of your configuration via command-line arguments.

As briefly demonstrated above, you can override individual defaults in the config file. For example, when running a Maze rollout, the default runner is parallel, but you could specify the sequential runner instead:

$ maze-run runner=sequential

Besides overriding specifying the config components, you can also override individual values in the config:

$ maze-run runner=sequential runner.max_episode_steps=1000

There is also more advanced syntax for adding/removing config keys and other patterns – for this, you can consult Hydra docs regarding basic overrides and extended override syntax.

Output Directory

Hydra also by default handles the output directory for each job you run.

By default, outputs is used as the base output directory and a new subdirectory is create inside for each run. Here, Hydra also logs the configuration for the current job in the .hydra subdirectory, so that you can always get back to it.

You can override the hydra output directory as follows:

$ maze-run hydra.run.dir=my_dir

More on the output directory setting can be found in Hydra docs: output/working directory and customizing working directory pattern.

Maze Runner Concept

In Maze, the maze-run command (that you have seen above already) is the single central utility for launching all sorts of jobs, like training or rollouts.

Under the hood, when you launch such a job, the following happens:

  1. Maze checks the runner part of the Hydra configuration that was passed through the command. And instantiates a runner object from it (subclass of Runner).

    (The runner component of the configuration always specifies the Runner class to be instantiated, along with any other arguments it needs at initialization.)

  2. Maze then calls the run method on the instantiated runner and passes it the whole config, as obtained from Hydra.

This enables the maze-run command to keep a lot of variability without much coupling of the individual functionalities. For example, rollouts are run through subclasses of RolloutRunner and trainings through subclasses of TrainingRunner.

You are also free to create your own subclasses for rollouts, trainings or any completely different use cases.

Where to Go Next

After understanding the basics of how Maze uses Hydra, you might want to:

Hydra: Your Own Configuration Files

We encourage you to add custom config or experiment files in your own project. These will make it easy for you to launch different versions of your environments and agents with different parameters.

To be able to use custom configuration files, you first need to create your config module and add it to the Hydra search path. Then, you can either

Step 1: Custom Config Module in Hydra Search Path

For this, first, create a module where your config will reside (let’s say your_project.conf) and place an __init__.py file in there.

Second, to make your project available to Hydra, make sure it is either installed using pip install -e ., or added to your Python path manually, using for example export PYTHONPATH="$PYTHONPATH:$PWD/" when in the project directory.

As a final step, you need to tell Hydra to look for your config files. This is can be done either by specifying your config directory along with each maze-run command using the -cd flag:

$ maze-run -cd your_project/conf ...

Or, to avoid specifying this with every command, you can add your config module to the Hydra search path by creating the following Hydra plugin (substitute your_project.conf with your actual config module path):

# Inside your project in: hydra_plugins/add_custom_config_to_search_path.py

"""Hydra plugin to register additional config packages in the search path."""
from hydra.core.config_search_path import ConfigSearchPath
from hydra.plugins.search_path_plugin import SearchPathPlugin


class AddCustomConfigToSearchPathPlugin(SearchPathPlugin):
    """Hydra plugin to register additional config packages in the search path."""

    def manipulate_search_path(self, search_path: ConfigSearchPath) -> None:
        """Add custom config to search path (part of SearchPathPlugin interface)."""
        search_path.append("project", "pkg://your_project.conf")

Now, you can add additional root config files as well as individual components into your config package.

For more information on search path customization, check Config Search Path and SearchPathPlugins in Hydra docs.

Step 2a: Custom Config Components

If what you are after is only providing custom options for some of the components Maze configuration uses (e.g., a custom environment configuration), then it suffices to add these into the relevant directory in your config module and you are good to go.

For example, if you want a custom configuration for the Gym Car Racing env, you might do:

# In your_project/conf/env/car_racing.yaml:

# @package env
_target_: maze.core.wrappers.maze_gym_env_wrapper.GymMazeEnv
env: CarRacing-v0

Then, you might call maze-run with the env=car_racing override and it will load the configuration from your file.

Depending on your needs, you can mix-and-match your custom configurations with configurations provided by Maze (e.g. use a custom env configuration while using a wrappers or models configuration provided by Maze).

Step 2b: Experiment Config

Another convenient way to assemble and maintain different configurations of your experiments is Hydra’s built-in Experiment Configuration.

It allows you to customize experiments by only specifying the changes to the default (master) configuration. You can for example change the trainer to PPO, the learning rate to 0.0001 and additionally activate the vector_obs wrapper stack by providing the following experiment configuration:

conf/experiment/cartpole_ppo_wrappers.yaml
# @package _global_

# defaults to override
defaults:
  - override /algorithm: ppo
  - override /wrappers: vector_obs

# overrides
algorithm:
  lr: 0.0001

To start the experiment from this experiment config file, run:

$ maze-run -cn conf_train +experiment=cartpole_ppo_wrappers

For more details on experimenting we refer to the experiment configuration docs.

Step 2c: Custom Root Config

If you require even more customization, you will likely need to define your own root config. This is usually useful for custom projects, as it allows you to create custom defaults for the individual config groups.

We suggest you start by copying one of the root configs already available in Maze (like conf_rollout or conf_train, depending on what you need), and then adding more required keys or removing those that are not needed. However, it is also not difficult to start from scratch if you know what you need.

Once you create your root config file (let’s say your_project/conf/my_own_conf.yaml), it suffices to point Hydra to it via the argument -cn my_own_conf, so your command would look like this (for illustrative purposes):

$ maze-run -cn my_own_conf

Then, all the defaults and available components that Hydra will look for depend on what you specified in your new root config.

For an overview of root config, check out config root & defaults.

Step 3: Custom Runners (Optional)

If you want to launch different types of jobs than what Maze provides by default, like implementing a custom training algorithm or deployment scenario that you would like to run via the CLI, you will benefit from creating a custom Runner.

You can subclass the desired class in the runner hierarchy (like the TrainingRunner if you are implementing a new training scheme, or the general Runner for some more general concept). Then, just create a custom config file for the runner config group that configures your new class, and you are good to go.

Where to Go Next

After understanding how custom configuration is done, you might want to:

Hydra: Advanced Concepts

This page features a collection of more advanced Maze and Hydra features that are used across the framework and power the configuration under the hood:

  • Factory, which Maze uses to turn configuration into instantiated objects, while allowing passing in already instantiated objects as well.

  • Interpolation, which allows you to reference parts of configuration from elsewhere.

  • Specializations, which allow you to load additional configuration files based on particular combinations of selected defaults.

Maze Factory

Factory wraps around Hydra’s own instantiation functionality and adds features like type hinting and checking, collections, configuration structure checks, and ability to take in already instantiated objects.

Using the factory, classes can accept ConfigType (or collections thereof, CollectionOfConfigType), which stands for either an already instantiated object, or a dictionary with configuration, which the factory will then use to build the instance.

Configuration dictionary consists of the _target_ attribute, along with any arguments that the instantiated class takes, e.g. (here denoted in YAML, as you will find it in many places across the framework):

_target_: maze.core.wrappers.maze_gym_env_wrapper.GymMazeEnv
env: CarRacing-v0

The factory then takes in the dictionary configuration (loaded from YAML using Hydra, or from anywhere else) and builds the object for you, checking that it is indeed of the expected type:

from maze.core.env.maze_env import MazeEnv
from maze.core.utils.factory import Factory

env = Factory(MazeEnv).instantiate({
    "_target_": "maze.core.wrappers.maze_gym_env_wrapper.GymMazeEnv",
    "env": "CarRacing-v0"
})

You can also pass in additional keyword arguments that the factory will then pass on to the constructor together with anything from the configuration dictionary:

from maze.core.env.maze_env import MazeEnv
from maze.core.utils.factory import Factory

env = Factory(MazeEnv).instantiate({
    "_target_": "maze.core.wrappers.maze_gym_env_wrapper.GymMazeEnv",
}, env="CarRacing-v0")

If you pass in an already instantiated object instead of a configuation dictionary, the instantiate method will only check that it is of the expected type and return it back. This allows components in Maze to be easily configurable both from YAML/dictionaries and by passing in already instantiated objects.

Interpolation

Hydra is based on OmegaConf and supports interpolation.

Interpolation allows us to reference and reuse a value defined elsewhere in the configuration, without repeating it. For example:

original:
  value: 1  # We want to reference this value elsewhere
some:
  other:
    structure: ${original.value}  # Reference

A (somewhat limited) form of interpolation is used also in specializations described below.

Specializations

Specializations are parts of config that depend on multiple components. For example, your wrapper configuration might depend on both the environment chosen (e.g., gym_pixel_env or gym_feature_env) and your model (e.g., default or rnn) – if using an RNN, you might want to include ObservationStackWrapper, but its configuration also depends on the environment used.

Then, specializations come to the rescue. In your root config file, you can include a specialization like this (for illustrative purposes):

defaults:
  - env: gym_pixel_env
  - model: default
  - env_model: ${defaults.1.env}-${defaults.2.model}
    optional: true

Then, when you run this configuration with env=gym_pixel_env and model=rnn, Hydra will look into the env_model directory for configuration named gym_pixel_env-rnn.yaml. This allows you to capture the dependencies between these two components easily without having to specify more overrides.

Specializations are well explained in Hydra docs here.

Where to Go Next

After understanding advanced Hydra configuration, you might want to:

Hydra: Overview explains the core concepts of configuration assembly, overrides and Maze runners controlling the CLI jobs. Hydra: Your Own Configuration Files shows how to get started with your own configuration in your custom projects. Hydra: Advanced Concepts explain other components and Hydra features that power Maze configuration under the hood, such as Maze factory, Hydra interpolations and specializations.

Environment Rendering

In cases when reviewing the statistics and event logs provided by the event system does not provide enough insight, rendering the environment state in a particular time step is helpful.

Maze supports two rendering modes:

  1. Rendering online during the rollout. This is possible simply using the sequential rollout runner for a rollout, and setting the rendering flag to true using the following overrides: runner=sequential runner.render=true.

  2. Rendering offline, in a Jupyter notebook, from trajectory data collected earlier. For environments which provide a Maze-compatible render, rollouts can be rendered and browsed retroactively. Review collecting and visualizing rollouts for more details. (Unfortunately, this mode is not yet supported for ordinary Gym envs – unless a custom Maze-compatible renderer is provided.)

Structured Environments

The basic reinforcement learning formulation assumes a single entity in an environment, enacting one action suggested by a single policy per step to fulfill exactly one task. We refer to this as a flat environment. A classic example for this is the cartpole balancing problem, in which a single entity attempts to fulfill the single task of balancing a cartpole. However, some problems incentivize or even require to challenge these assumptions:

  1. Single entity: Plenty of real-world scenarios motivate taking more than one acting entities into account. E.g.: optimizing delivery with a fleet of vehicles involves emergent effects and interdependences between individual vehicles, such as that the availability and suitability of orders for any given vehicle depends on the proximity and activity of other vehicles.
    Treating them in isolation from each other is inefficient and detrimental to the learning process. While it is possible to have a single agent represent and coordinate all vehicles simultaneously, it can be more efficient to train multiple agents to facilitate collaborative behaviour for one vehicle each.
  2. One action suggested by a single policy: Some usecases, such as cutting raw material according to customer specifications with as little waste as possible, necessarily involve a well-defined sequence of actions. Stock-cutting involves (a) the selection of a piece of suitable size and (b) cutting it in an appropriate manner. We know that (a) is always followed by (b) and that the latter is a necessary precondition for the former.
    We can incorporate this information in our RL control loop to facilitate a faster learning process by enforcing the execution of policies in a certain order: E.g. select, then cut. This entails that while the action is still chosen by the policy, the policy itself is chosen by the environment. The sequential nature of such actions often lends itself to action masking to increase learning efficiency 1.
  3. Exactly one task: Occasionally, the problem we want to solve cannot be neatly formulated as a single task, but consists of a hierarchy of tasks. This is exemplified by pick and place robots. They solve a complex task, which is reflected by the associated hierarchy of goals: The overall goal requires (a) reaching the target object, (b) grasping the target object, (c) moving the target object to target location and (d) placing the target object safey in the target location. Solving this task cannot be reduced to a single goal.

Maze addresses these problems by introducing StructuredEnv. We cover some of its applications and their broader context, including literature and examples, in a series of articles:

Flat Environments

Note

Recommended reads prior to this article:

All instantiable environments in Maze are subclasses of StructuredEnv. Structured environments are discussed in Control Flows with Structured Environments, which we recommend to read prior to this article. Flat environments in our terminology are those utilizing a single actor and a single policy, i. e. a single actor, and conducting one action per step. Within Maze, flat environments are a special case of structured environments.

An exemplary implementation of a flat environment for the stock cutting problem can be found here.

Control Flow

Let’s revisit a classic depiction of a RL control flow first:

_images/control_flow_simple.png

Simplified control flow within a flat scenario. The agent selects an action, the environment updates its state and computes the reward. There is no need to distinguish between different policies or agents since we only have one of each. actor_id() should always return the same value.

A more general framework however needs to be able to integrate multiple agents and policies into its control flow. Maze does this by implementing actors, which are abstractions introduced in the RL literature to represent one policy applied on or used by one agent. The figure above collapses the concepts of policy, agent and actor into a single entity for the sake of simplicity. The actual control flow for a flat environment in Maze is closer to this:

_images/control_flow_complex.png

More accurate control flow for a flat environment in Maze, showing how the actor mechanism integrates agent and policy. Dashed lines denote the exchange of information on demand as opposed to doing so passing it to or returning it from step().

A flat environment hence always utilizes the same actor, i.e. the same policy for the same agent. Due to the lack of other actors there is no need for the environment to ever update its active actor ID. The concept of actors is crucial to the flexibility of Maze, since it allows to scale up the number of agents, policies or both. This enables the application of RL to a wider range of use cases and exploit properties of the respective domains more efficiently.

Where to Go Next

Multi-Stepping

We define multi-stepping as the execution of more than one action (or sub-step) in a single step. This is motivated by problem settings in which a certain sequence of actions is known a priori. In such cases incorporating this knowledge can significantly increase learning efficiency. The stock cutting problem poses an example: It is known, independently from the specifics of the environment’s state, that fulfilling a single customer order for a piece involves (a) picking a piece at least as big as the ordered item (b) cutting it to the correct size.

While it is not trivial to decide which items to pick for which orders and how to cut them, the sequence of piece selection before cutting is constant - there is no advantage to letting our agent figure it out by itself. Maze permits to incorporate this sort of domain knowledge by enabling to select and execute more than one action in a single step. This is done by utilizing the actor mechanism to execute multiple policies in a fixed sequence.

In the case of the stock cutting problem two policies could be considered: “select” and “cut”. The piece selection action might be provided to the environment at the beginning of each step, after which the cutting policy - conditioned on the current state with the already selected piece - can be queried to produce an appropriate cutting action.

An implementation of a multi-stepping environment for the stock cutting problem can be found here.

Control Flow

In general, the control flow for multi-stepping environments involve at least two policies and one agent. It is easily possible, but not necessary, to include multiple agents in a multi-step scenario. The following image depicts a multi-step setup with one agent and an arbitrary number of sub-steps/policies.

_images/control_flow2.png

Control flow within a multi-stepping scenario assuming a single agent. The environment keeps track of the active step and adjusts its policy key (via actor_id()) accordingly. Dashed lines denote the exchange of information on demand as opposed to doing so passing it to or returning it from step().

When comparing this to the control flow depicted in the article on flat environments you’ll notice that here we consider several policies and therefore several actors - more specifically, in a setup with n sub-steps (or actions per step) we have at least n actors. Consequently the environment has to update its active actor ID, which is not necessary in flat environments.

Relation to Hierarchical RL

Hierarchical RL (HRL) describes a hierarchical formulation of reinforcement learning problems: tasks are broken down into (sequences of) subtasks, which are learned in a modular manner. Multi-stepping shares this property with HRL, since it also decomposes a task into a series of subtasks. Furthermorek, the multi-stepping control flow bears strong similarity to the one for hierarchical RL<struct_env_hierarchical> - in fact, multi-stepping could be seen as a special kind of hierarchical RL with a fixed task sequence and a single level of hierarchy.

Relation to Auto-Regressive Action Distributions

Multi-stepping is closely related to auto-regressive action distributions (ARAD) as used in in DeepMind’s Grandmaster level in StarCraft II using multi-agent reinforcement learning. Both ARADs and multi-stepping are motivated by a lack of temporal coherency in the sequence of selected actions: if there is some necessary, recurring order of actions, it should be identified it as quickly as possible.

ARADs still execute one action per step, but condition it on the previous state and action instead of the state alone. This allows them to be more sensitive towards such recurring patterns of actions. Multi-stepping allows to incorporate domain knowledge about the correct order of actions or tasks without having to rely on learned autoregressive policies learning, but depends on the environment to incorporate it. ARAD policies do not presuppose (and cannot make use of) any such prior knowledge.

ARADs are not explicitly implemented in Maze, but can be approximated. This can be done by including prior actions in observations supplied to the agent, which should condition the used policy on those actions. If relevant domain knowledge is available, we recommend to implement the multi-stepping though.

Where to Go Next

Multi-Agent RL

Multi-agent reinforcment learning (MARL) describes a setup in which several collaborating or competing agents suggest actions for at least one of an environment’s acting entitites 1 each. This introduces the additional complexity of emergent effects between those agents. Some problems require to or at least benefit from deviating from a single-agent formulation, such as the vehicle routing problem, (video) games like Starcraft, traffic coordination, power systems and smart grids and many others.

Maze supports multi-agent learning via structured environment. In order to make a StructuredEnv compatible with such a setup, it needs to keep track of the activities of each agent internally. How this is done and the order in which sequence agents enacted their actions is entirely to the environment. As per customary for a structured environment, it is required to provide the ID of the active actor via actor_id() (see here for more information on the distinction between actor and agent). There are no further prequisites to use multiple agents with an environment.

Control Flow

It is easily possible, but not necessary, to include multiple policies in a multi-agent scenario. The control flow with multiple agents and a single policy can be summarized like this:

_images/control_flow1.png

Control flow within a multi-agent scenario assuming a single policy. Dashed lines denote the exchange of information on demand as opposed to doing so passing it to or returning it from step().

When comparing this to the control flow depicted in the article on flat environments you’ll notice that here we consider several agents and therefore several actors - more specifically, in a setup with n agents we have at least n actors. Consequently the environment has to update its active actor ID, which is not necessary in flat environments.

Where to Go Next


1

We use “acting entity” in this context in the sense of something that acts or is manipulated to act in order to solve a given problem. E.g.: In the case of the vehicle routing problem it is neither desired nor should it be possible for an agent to change the layout of the map or how orders are generated, since these factors constitute a part of the problem setting. Instead, the goal is to learn a vehicle routing behaviour that is optimal w.r.t. processing the generated orders - the vehicles are acting entities. In MARL settings it is customary to map one agent to one manipulable entity, hence the term “agent” itself is often used to refer to the manipulable entity it represents.

Hierarchical RL

Reinforcement learning is prone to scaling and generalization issues. With large action spaces, it takes a lot of time and resources for agents to learn the desired behaviour successfully. Hierarchical reinforcement learning (HRL) attempts to resolve this by decomposing tasks into a hierarchy of subtasks. This addresses the curse of dimensionality by mitigating or avoiding the exponential growth of the action space.

Beyond reducing the size of the action space HRL also provides an opportunity for easier generalization. Through its modularization of tasks, learned policies for super-tasks may be reused even if the underlying sub-tasks change. This enables transfer learning between different problems in the same domain.

Note that the action space can also be reduced with action masking as used in e.g. StarCraft II: A New Challenge for Reinforcement Learning, which indicates the invalidity of certain actions in the observations provided to the agent. HRL and action masking can be used in combination. The latter doesn’t address the issue of generalization and transferability though. Whenever possible and sensible we recommend to use both.

Motivating Example

Consider a pick and place robot. It is supposed to move to an object, pick it up, move it to a different location and place it there. It consists of different segments connected via joints that enable free movement in three dimensions and a gripper able to grasp and hold the target object. The gripper may resemble a pair of tongs or be more complex, e.g. be built to resemble a human hand.

A naive approach would present all possible actions, i.e. rotating the arm segments and moving the gripper, in a single action space. If the robot’s arm segments can move in \(n\) and its gripper in \(m\) different ways, a flat action space would consist of \(n * m\) different actions.

The task at hand can be intuitively represented as a hierarchy however: The top-level task is composed of the task sequence of “move”, “grasp”, “move”, “place”. This corresponds to a top-level policy choosing one of three sub-policies enacting primitive actions, i.e. arm or gripper movements. This enables the reusability of individual (sub-)policies for other tasks in the same domain.

To reduce the dimensionality of the action space an HRL approach could omit gripper actions in the “move” action space and arm actions in the “grasp” and “place” action spaces. The total number of actions then amounts to the sum of the numbers of possible actions for all individual policies: \(3 + m + n + m\).
Depending on the active policy the agent only has to consider up to \(max(3, n, m)\) actions at once, which is signficantly less than the \(n * m\) actions in the flat case for any realistic values for \(n\) and \(m\). The complexity of problems with large action spaces can be reduced considerably by a hierarchical decomposition like this.

Control Flow

_images/control_flow.png

Control flow within a HRL scenario assuming a single agent. The task hierarchy is built implicitly in step(). Dashed lines denote the exchange of information on demand as opposed to doing so passing it to or returning it from step().

The control flow for HRL scenarios doesn’t obviously reflect the hierarchical aspect. This is because the definition and execution of the task hierarchy happens implicitly in step(): the environment determines which task is currently active and which task should be active at the end of the current step. This allows for an arbitrarily flexible and complex task dependency graph. The possibility to implement a different ObservationConversionInterface and ActionConversionInterface for each policy enables to tailor actions and observations to the respective task. This control flow bears strong similarity to the one for multi-stepping<struct_env_multistep> - in fact, multi-stepping could be seen as a special kind of hierarchical RL with a fixed task sequence and a single level of hierarchy.

Where to Go Next

Beyond Flat Environments with Actors

StructuredEnv bakes the concept of actors into its control flow.

An actor describes a specific policy that is applied on - or used by - a specific agent. They are uniquely identified by the agent’s ID and the policy’s key. From a more abstract perspective an actor describes which task should be done (the policy) for whom (the agent respectively the acting entities it represents). In the case of the vehicle routing problem an agent might correspond to a vehicle and a policy might correspond to a task like “pick an order” or “drive to point X”. A StructuredEnv has exactly one active actor at any time. There can be an arbitrary number of actors. They can be created and destroyed dynamically by the environment, by respectively specifying their ID or marking them as done. Their lifecycles are thus flexible, they don’t have to be available through the entirety of the environment’s lifecycle.

_images/struct_env_control_flow.png

Overview of control flow with structured environments. Note that the line denoting the communication of the active actor ID is dashed because it is not returned by step(), but instead queried via actor_id().

Decoupling actions from steps

The actor mechanism decouples actions from steps, thereby allowing environments to query actions for its actors on demand, not just after a step has been completed. The cardinality between involved actors and steps is therefore up to the environment - one actor can be active throughout multiple steps, one step can utilize several actors, both or neither (i.e. exactly one actor per step).

The discussed stock cutting problem for example might have policies with the keys “selection” or “cutting”, both of which take place in a single step; the pick and place problem might use policies with the keys “reach”, “grasp”, “move” or “place”, all of which last one to several steps.

Support of multiple agents and policies

A multi-agent scenario can be realized by defining the corresponding actor IDs under consideration of the desired number of agents. Several actors can use the same policy, which infers the recommended actions for the respective agents. Note that it is only reasonable to add a new policy if the underlying process is distinct enough from the activity described by available policies.

In the case of the vehicle routing problem using separate policies for the activies of “fetch item” and “deliver item” are likely not warranted: even though they describe different phases of the environment lifecycle, they describe virtually the same activity. While Maze provides default policies, you are encouraged to write a customized policy that fits your use case better - see Policies, Critics and Agents for more information.

Selection of active actor

The environment determines the active actor based on its internal state. The current actor evaluates the observation provided by the environment and selects an appropriate action, i.e. every action is associated with a specific actor. This action updates the environment’s state, after which the environment reevaluates which actor should be active.
Since it is left to the environment to decide when which actor should be active, it is possible to chain, combine and nest policies and therefore tasks in arbitrary manner.

Every StructuredEnv is required to implement actor_id(), which returns the ID of the currently active actor. An environment with a single actor, e. g. a flat Gym environment, may return a single-actor signature such as (0, 0). At any time there has to be exactly one active actor ID.

Policy-specific space conversion

Since different policies may benefit from or even require a different preprocessing of their actions and/or observations (especially, but not exclusively, for action masking), Maze requires the specification of a corresponding ActionConversionInterface and ObservationConversionInterface class for each policy. This permits to tailor actions and/or observations to the mode of operation of the relevant policy.


The actor concept and the mechanisms supporting it are thus capable of

  • representing an arbitrary number of agents;

  • identifying which policy should be applied for which agent via the provision of actor_id();

  • representing an arbitrary number of actors with flexible lifecycles that may differ from their environment’s;

  • supporting an arbitrary nesting of policies (and - in further abstraction - tasks);

  • selecting actions via the policy fitting the currently active actor;

  • preprocessing actions and observations w.r.t. the currently used actor/policy.

This allows to bypass the three restrictions laid out at the outset.

Where to Go Next

Read about some of the patterns and capabilities possible with structured environments:

The underlying communication pathways are identical for all instances of StructuredEnv. Multi-stepping, multi-agent, hierarchical or other setups are all particular manifestations of structured environments and its actor mechanism. They are orthogonal to each other and can be used in any combination.


1

Action masking is used for many problems with large action spaces which would otherwise intractable, e.g. StarCraft II: A New Challenge for Reinforcement Learning.

High-level API: RunContext

This page describes the RunContext, a high-level API for training, rollout and evaluation of Maze agents in plain Python.

Motivation

Maze utilizes Hydra to facilitate a powerful configuration mechanism boosting developers’ flexibility in their creation of reinforcement learning projects. Hydra however is geared towards command line usage, so these benefits are not accessible when working with individual Maze components (like Trainer, Wrapper, MazeEnv, …) and composing them manually in Python.

E.g.: It is not possible to generate components directly from the provided configuration modules. This would however be quite useful, as it allows to loads pre-configured (sets of) components. This can be exemplified by the pixel_obs wrapper configuration module, which defines several wrappers useful for the preprocessing, normalization and logging of such pixel space observations. Via the CLI this can be loaded trivially via ... wrappers=pixel_obs ... - yet there is no obvious way to leverage Maze’ Hydra-based configuration system from within a Python script. This also affects other features, like the inability to instantiate objects from a YAML-/dict-based configuration object (which can be very convenient with increasing experiment or application complexity).

This motivates the introduction of RunContext, a high-level API for training, rollout and evaluation. When working with Maze from within a Python script (as opposed to via the CLI with maze-run) we highly recommend that you start with RunContext: It requires very little configuration overhead to get things rolling, yet offers a lot of flexibility if you require additional configuration. While there might be cases where this is not sufficient, we expect that this would not happen too frequently.

Comparison with the CLI (maze-run)

We designed RunContext to be largely congruent with the CLI, i.e. maze-run. It utilizes Hydra internally and offers the same base functionality, but differs in a couple of ways - RunContext

  • … is a recent addition and still lacks support for a number of capabilities: Rolling out a policy is not fully supported yet, as are RLlib integration and some of the more advanced Hydra features like multi-runs. These issues (particularly rollout support) are on our todo list however and will be implemented shortly.

  • … accepts (most) components to be specified as instantiated complex Python objects, configuration dictionaries or configuration module name. In contrast, the CLI accepts the specification of components as configuration module name or as primitive values. As of now this entails however that once instantiated Python objects are passed, the customary experiment configuration cannot be logged anymore due to a lack of knowledge about the corresponding configuration dictionary. This issue is on our roadmap.

  • … offers a few additional options for convenience’ sake, such as output suppression via silent=True or setting the working directory via run_dir='...'.

Usage

This section aims to convey the principal ideas and features of RunContext. For further explanation and a detailed discussion of the exposed interface as well as auxiliary components and utilities see here.

Initialization

As mentioned previously, the RunContext API is largely congruent with the maze-run CLI. Consequently the initialization can be done in a similar fashion. Here is one example with a particular training run configuration using the CLI, RunContext initialized with configuration module names and RunContext initialized with a mix of configuration module names and complex Python objects.

maze-run -cn conf_train env.name=CartPole-v0 algorithm=a2c model=vector_obs critic=template_state

Environments cannot be passed in instantiated form, but instead as callable environment factories:

rc = RunContext(env=lambda: GymMazeEnv('CartPole-v0'))

As with the CLI, any attribute in the configuration hierarchy can be overridden, not just the explicitly exposed top-level attributes like env or algorithm. This can be achieved using the overrides dictionary as seen above for "env.name". It is also possible to pass complex values:

policy_composer_config = {
    '_target_': 'maze.perception.models.policies.ProbabilisticPolicyComposer',
    'networks': [{
        '_target_': 'maze.perception.models.built_in.flatten_concat.FlattenConcatPolicyNet',
        'non_lin': 'torch.nn.Tanh',
        'hidden_units': [256, 256]
    }],
    "substeps_with_separate_agent_nets": [],
    "agent_counts_dict": {0: 1}
}
rc = RunContext(overrides={"model.policy": policy_composer_config})

Note that by design configuration module name resolution is not triggered for attributes in overrides. This is necessary for some of the explicitly exposed arguments however. We recommend strongly to pass an argument explicitly, if it is explicitly exposed - otherwise a correct assembly of the underlying configuration structure cannot be guaranteed. E.g. if you want to pass an instantiated algorithm configuration like

alg_config = A2CAlgorithmConfig(
    n_epochs=1,
    epoch_length=25,
    deterministic_eval=False,
    eval_repeats=2,
    patience=15,
    critic_burn_in_epochs=0,
    n_rollout_steps=100,
    lr=0.0005,
    gamma=0.98,
    gae_lambda=1.0,
    policy_loss_coef=1.0,
    value_loss_coef=0.5,
    entropy_coef=0.00025,
    max_grad_norm=0.0,
    device='cpu'
)

then

rc = RunContext(algorithm=alg_config)

Further examples of how to use Maze with both the CLI and the high-level API can be found here.

Training

Training is straightforward with an initialized RunContext:

rc.train()
# Or with a specified number of epochs:
rc.train(n_epochs=10)

train() passes on all accepted arguments to the instantiated trainer. At the very least the number of epochs to train can be specified, everything else depends on the arguments that the corresponding trainer exposes. See here for further information on trainers in Maze. If no arguments are specified, Maze uses the default values included in the loaded configuration.

Rollout

Rollouts are not supported directly yet, but can be implemented manually:

env_factory = lambda: GymMazeEnv('CartPole-v0')
rc = run_context.RunContext(env=lambda: env_factory())
rc.train()

# Run trained policy.
env = env_factory()
obs = env.reset()
for i in range(10):
    action = rc.compute_action(obs)
    obs, rewards, dones, info = env.step(action)

Evaluation

To evaluate a trained policy, use the integrated evaluation functionality.

rc = RunContext(env=lambda: GymMazeEnv('CartPole-v0'))
rc.train()
rc.evaluate()

Customizing Core and Maze Envs

Whenever simulations reach a certain level of complexity or (ideally) already exist, but have been developed for other purposes than the RL scenario, the Gym-style environment interfaces might not be sufficient anymore to meet all technical requirements (e.g., the state is too complex to be represented as a simple Gym-style numpy array). In case of existing simulations it probably was not even taken into account at all and we have to deal with simulation specific interfaces and objects.

To cope with such situations Maze introduces a few additional concepts which are summarized in the figure below. Before we continue with some practical examples emphasizing why this structure is useful for environment customization and convenient experimentation, we first describe the concepts and components in a bit more detail. You can also find these components in the reference documentation.

_images/observation_action_interfaces1.png

Observation- and ActionConversionInterfaces:

Maze introduces MazeStates and MazeActions, extending Observations and Actions (represented as numerical arrays) to simulation specific generic objects. This grants more freedom in choosing appropriate environment-specific representations to separate the data model from the numerical representation, which in turn greatly simplifies the development and readability of environment and engineered baseline agent implementations.

  • Action: the Gym-style, machine readable action.

  • MazeAction: the simulation specific representation of the action (e.g., an arbitrary Python object).

  • ActionConversionInterface: maps agent actions to environment (simulation) specific MazeActions and defines the respective Gym action space.

  • Observation: the Gym-style, machine readable observation (e.g., a numpy array).

  • MazeState: the simulation specific representation of the observation (e.g. an arbitrary Python object).

  • ObservationConversionInterface: maps simulation MazeStates to Gym-style observations and defines the respective Gym observation space.

Core and Maze Environments:

The same distinction is carried out for environments.

  • CoreEnv: this is the central environment, which could be also seen as the simulation, forming the basis for actual, RL trainable environments. CoreEnvs accept MazeAction objects as input and yield MazeState objects as response.

  • CoreEnv Config: configuration parameters for the CoreEnvironment (the simulation).

  • MazeEnv: wraps the CoreEnvs as a Gym-style environment in a reusable form, by utilizing the interfaces (mappings) from the MazeState to the observations space and from the MazeAction to the action space.

List of Features

Introducing the concepts outlined above allows the following:

  • Implement and maintain observations and actions as arbitrarily complex, simulation specific objects (MazeStates and MazeActions). In many cases sticking to Gym spaces gets quite cumbersome and makes the development processes unnecessarily complex.

  • Easily experiment with different observation and action spaces simply by switching the Observation- and ActionConversionInterface.

  • Train agents based on existing 3rd party simulations (environments) by implementing the Observation- and ActionConversionInterfaces (of course this also requires to have a Python API available).

  • Easy configuration of the CoreEnv (simulation).

Example: Core- and MazeEnv Configuration

The config snippet below shows an example environment configuration for the built-in cutting-2d environment.

# @package env
_target_: maze_envs.logistics.cutting_2d.env.maze_env.Cutting2DEnvironment

# parametrizes the core environment (simulation)
core_env:
  max_pieces_in_inventory: 1000
  raw_piece_size: [100, 100]
  demand_generator:
    _target_: mixed_periodic
    n_raw_pieces: 3
    m_demanded_pieces: 10
    rotate: True
  # defines how rewards are computed
  reward_aggregator:
    _target_: maze_envs.logistics.cutting_2d.reward.default.DefaultRewardAggregator

# defines the conversion of actions to executions
action_conversion
  - _target_: maze_envs.logistics.cutting_2d.space_interfaces.action_conversion.dict.ActionConversion
    max_pieces_in_inventory: 1000

# defines the conversion of states to observations
observation_conversion:
  - _target_: maze_envs.logistics.cutting_2d.space_interfaces.observation_conversion.dict.ObservationConversion
    max_pieces_in_inventory: 1000
    raw_piece_size: [100, 100]

The config defines:

  • which MazeEnv to use,

  • the parametrization of the CoreEnv including reward computation,

  • how MazeStates are converted to observations and

  • how actions are converted to MazeActions.

All components together compose a concrete RL problem instance as a trainable environment. In particular, whenever you would like to experiment with specific aspects of your RL problem (e.g. tweak the observation space) you only have to exchange the respective part of your environment configuration.

Note

As showing concrete implementations of a CoreEnv or the Observation- and ActionConversionInterfaces is beyond the scope of this page we refer to the Maze - step by step tutorial for details.

Where to Go Next

Customizing / Shaping Rewards

In a reinforcement learning problem the overall goal is defined via an appropriate reward signal. In particular, reward is attributed to certain, problem specific key events and the current environment state. During the training process the agent then has to discover a policy (behaviour) that maximizes the cumulative future reward over time. In case of a meaningful reward signal such a policy will be able to successfully address the decision problem at hand.

_images/reward_aggregation.png

From a technical perspective, reward customization in Maze is based on the environment state in combination with the general event system (which also serves other purposes), and is implemented via RewardAggregators. In summary, after each step, the reward aggregator gets access to the environment state, along with all the events the environment dispatched during the step (e.g., a new item was replenished to inventory), and can then calculate arbitrary rewards based on these. This means it is possible to modify and shape the reward signal based on different events and their characteristics by plugging in different reward aggregators without further modifying the environment.

Below we show how to get started with reward customization by configuring the CoreEnv and by implementing a custom reward.

List of Features

Maze event-based reward computation allows the following:

  • Easy experimentation with different reward signals.

  • Implementation of custom rewards without the need to modify the original environment (simulation).

  • Computing simple rewards based on environment state, or using the full flexibility of observing all events from the last step

  • Combining multiple different objectives into one multi-objective reward signal.

  • Computation of multiple rewards in the same env, each based on a different set of components (multi agent).

Configuring the CoreEnv

The following config snippet shows how to specify reward computation for a CoreEnv via the field reward_aggregator. You only have to set the reference path of the RewardAggregator and reward computation will be carried out accordingly in all experiments based on this config.

For further details on the remaining entries of this config you can read up on how to customize Core- and MazeEnvs.

# @package env
_target_: maze_envs.logistics.cutting_2d.env.maze_env.Cutting2DEnvironment

# parametrizes the core environment (simulation)
core_env:
  max_pieces_in_inventory: 1000
  raw_piece_size: [100, 100]
  demand_generator:
    _target_: mixed_periodic
    n_raw_pieces: 3
    m_demanded_pieces: 10
    rotate: True
  # defines how rewards are computed
  reward_aggregator:
    _target_: maze_envs.logistics.cutting_2d.reward.default.DefaultRewardAggregator

# defines the conversion of actions to executions
action_conversion
  - _target_: maze_envs.logistics.cutting_2d.space_interfaces.action_conversion.dict.ActionConversion
    max_pieces_in_inventory: 1000

# defines the conversion of states to observations
observation_conversion:
  - _target_: maze_envs.logistics.cutting_2d.space_interfaces.observation_conversion.dict.ObservationConversion
    max_pieces_in_inventory: 1000
    raw_piece_size: [100, 100]

Implementing a Custom Reward

This section contains a concrete implementation of a reward aggregator for the built-in cutting environment (which bases its reward solely on the events from the last step, as that is more suitable than checking current environment state).

In summary, the reward aggregator first declares which events it is interested in (the get_interfaces method). At the end of the step, after all the events have been accumulated, the reward aggregator is asked to calculate the reward (the summarize_reward method). This is the core of the reward computation – you can see how the events are queried and the reward assembled based on their values.

"""Assigns negative reward for relying on raw pieces for delivering an order."""
from typing import List, Optional

from maze.core.annotations import override
from maze.core.env.maze_state import MazeStateType
from maze.core.events.pubsub import Subscriber
from maze_envs.logistics.cutting_2d.env.events import InventoryEvents
from maze.core.env.reward import RewardAggregatorInterface


class RawPieceUsageRewardAggregator(RewardAggregatorInterface):
    """
    Reward scheme for the 2D cutting env penalizing raw piece usage.

    :param reward_scale: Reward scaling factor.
    """
    def __init__(self, reward_scale: float):
        super().__init__()
        self.reward_scale = reward_scale

    @override(Subscriber)
    def get_interfaces(self) -> List:
        """
        Specification of the event interfaces this subscriber wants to receive events from.
        Every subscriber must implement this configuration method.

        :return: A list of interface classes.
        """
        return [InventoryEvents]

    @override(RewardAggregatorInterface)
    def summarize_reward(self, maze_state: Optional[MazeStateType] = None) -> float:
        """
        Summarize reward based on the orders and pieces to cut, and return it as a scalar.

        :param maze_state: Not used by this reward aggregator.
        :return: the summarized scalar reward.
        """

        # iterate replenishment events and assign reward accordingly
        reward = 0.0
        for _ in self.query_events(InventoryEvents.piece_replenished):
            reward -= 1.0

        # rescale reward with provided factor
        reward *= self.reward_scale

        return reward

When adding a new reward aggregator you (1) have to implement the RewardAggregatorInterface and (2) make sure that it is accessible within your Python path.

Besides that you only have to provide the reference path of the reward_aggregator to use:

reward_aggregator:
    _target_: my_project.custom_reward.RawPieceUsageRewardAggregator
    reward_scale: 0.1

Where to Go Next

  • Additional options for customizing environments can be found under the entry “Environment Customization” in the sidebar.

  • For further technical details we highly recommend to read up on the Maze event system.

  • To see another application of the event system you can read up on the Maze logging system.

Environment Wrappers

Environment wrappers are an elegant way to modify and customize environments for RL training and experimentation. As the name already suggests, they wrap an existing environment and allow to modify different parts of the agent-environment interaction loop including observations, actions, the reward or any other internals of the environment itself.

_images/environment_wrappers.png

To gain access to the functionality of Maze environment wrappers you simply have to add a wrapper stack in your hydra configuration. To get started just copy one of our hydra config snippets below or use it directly within Python.

Note

Wrappers have been already introduced in OpenAi’s Gym and elegantly expose methods and attributes of all nested envs. However, wrapping destroys the class hierarchy, querying the base classes is not straight-forward. Maze environment wrappers fix the behaviour of isinstance() for arbitrarily nested wrappers.

List of Features

Maze environment wrappers allows the following:

Example 1: Customizing Environments with Wrappers

To make use of Maze environment wrappers just add a config snippet as listed below.

# @package wrappers
RandomResetWrapper:
  min_skip_steps: 0
  max_skip_steps: 100
maze.core.wrappers.time_limit_wrapper.TimeLimitWrapper:
  max_episode_steps: 1000

Details:

  • It applies the specified wrappers in the defined order from top to bottom.

  • Adds a RandomResetWrapper randomly skipping the first 0 to 100 frames

  • Adds a TimeLimitWrapper restricting the maximum temporal horizon of the environment

Example 2: Using Custom Wrappers

In case the built-in wrappers provided with Maze are not sufficient for your use case you can of course implement and add additional custom wrappers.

# @package wrappers
my_project.wrappers.custom_wrapper.CustomObserverWrapper:
  parameter_1: 0.5
  parameter_2: 1000

When adding a new environment wrappers you (1) have to implement the Wrapper interface and (2) make sure that it is accessible within your Python path. Besides that you only have to provide the reference path of the wrapper to use, plus any parameters the wrapper initializer needs.

Example 3: Plain Python Configuration

If you are not working with the Maze command line interface but still want to use wrappers directly within Python you can start with the code snippet below.

"""Contains an example showing how to add wrappers."""
from maze.core.wrappers.random_reset_wrapper import RandomResetWrapper
from maze.core.wrappers.time_limit_wrapper import TimeLimitWrapper

# instantiate the environment
env = ...

# apply wrappers
env = RandomResetWrapper.wrap(env, min_skip_steps=0, max_skip_steps=100)
env = TimeLimitWrapper.wrap(env, max_episode_steps=1000)

Built-in Wrappers

Maze already comes with built-in environment wrappers. You can find a list and further details on the functionality of the respective wrappers in the reference documentation.

For the following wrappers we also provide a more extensive documentation:

Where to Go Next

Observation Pre-Processing

Sometimes it is required to pre-process or modify observations before passing them through our policy or value networks. This might be for example the conversion of an three channel RGB image to a single channel grayscale image or the one-hot encoding of a categorical observation such as the current month into a feature vector of length 12. Maze supports observation pre-processing via the PreProcessingWrapper.

_images/pre_processing_overview.png

This means to gain access to observation pre-processing and to the features listed below you simply have to add the PreProcessingWrapper to your wrapper stack in your Hydra configuration.

To get started you can also just copy one of our Hydra config snippets or use it directly from Python.

List of Features

Maze observation pre-processing supports:

  • Gym dictionary observation spaces

  • Individual pre-processors for all sub-observations of these dictionary spaces

  • Cascaded pre-processing pipelines for a single observation (e.g. first convert an image to grayscale before inserting an additional dimension from the left for CNN processing)

  • The option to keep both, the original as well as the pre-processed observation

  • Implicit update of affected observation spaces according to the pre-processor functionality

Example 1: Observation Specific Pre-Processors

This example adds pre-processing to two observations (rgb_image and categorical_feature) contained in a dictionary observation space.

# @package wrappers
maze.core.wrappers.observation_preprocessing.preprocessing_wrapper.PreProcessingWrapper:
    pre_processor_mapping:
        - observation: rgb_image
          _target_: maze.preprocessors.Rgb2GrayPreProcessor
          keep_original: true
          config:
            num_flatten_dims: 2
        - observation: categorical_feature
          _target_: maze.preprocessors.OneHotPreProcessor
          keep_original: false
          config: {}

Details:

  • Adds a gray scale converted version of observation rgb_image to the observation space but also keeps the original observation.

  • Replaces the observation categorical_feature with an one-hot encoded version and drops the original observation.

  • Observations space after pre-processing: {rgb_image, rgb_image-rgb2gray, categorical_feature-one_hot})

Example 2: Cascaded Pre-Processing

This example shows how to apply multiple pre-processor in sequence to a single observation.

# @package wrappers
maze.core.wrappers.observation_preprocessing.preprocessing_wrapper.PreProcessingWrapper:
  pre_processor_mapping:
    - observation: rgb_image
      _target_: maze.preprocessors.Rgb2GrayPreProcessor
      keep_original: false
      config:
        rgb_dim: -1
    - observation: rgb_image-rgb2gray
      _target_: maze.preprocessors.ResizeImgPreProcessor
      keep_original: false
      config:
        target_size: [96, 96]
        transpose: false
    - observation: rgb_image-rgb2gray-resize_img
      _target_: maze.preprocessors.UnSqueezePreProcessor
      keep_original: false
      config:
        dim: -3

Details:

  • Converts observation rgb_image into a gray scale image, then scales this gray scale image to size 96 x 96 pixel and finally inserts an additional dimension at index -3 to prepare the observation for CNN processing.

  • None of the intermediate observations is kept as we are only interested in the final result here.

  • Observations space after pre-processing: {rgb_image-rgb2gray-resize_img}).

Example 3: Using Custom Pre-Processors

In case the built-in pre-processors provided with Maze are not sufficient for your use case you can of course implement and add additional custom processors.

# @package wrappers
maze.core.wrappers.observation_preprocessing.preprocessing_wrapper.PreProcessingWrapper:
    pre_processor_mapping:
        - observation: rgb_image
          _target_: my_project.preprocessors.custom.CustomPreProcessor
          keep_original: true
          config:
            num_flatten_dims: 2

When adding a new pre-processor you (1) have to implement the PreProcessor interface and (2) make sure that it is accessible within your Python path. Besides that you only have to provide the reference path of the pre-processor to use.

Observations will be tagged with the filename of your custom preprocessor (e.g. rgb_image -> rgb_image-custom).

Example 4: Plain Python Configuration

If you are not working with the Maze command line interface but still want to reuse observation pre-processing directly within Python you can start with the code snippet below.

"""Contains an example showing how to use observation pre-processing directly from python."""
from maze.core.wrappers.maze_gym_env_wrapper import GymMazeEnv
from maze.core.wrappers.observation_preprocessing.preprocessing_wrapper import PreProcessingWrapper

# this is the pre-processor config as a python dict
config = {
    "pre_processor_mapping": [
        {"observation": "observation",
         "_target_": "maze.preprocessors.Rgb2GrayPreProcessor",
         "keep_original": False,
         "config": {"rgb_dim": -1}},
    ]
}

# instantiate a maze environment
env = GymMazeEnv("CarRacing-v0")

# wrap the environment for observation pre-processing
env = PreProcessingWrapper.wrap(env, pre_processor_mapping=config["pre_processor_mapping"])

# after this step the training env yields pre-processed observations
pre_processed_obs = env.reset()

Built-in Pre-Processors

Maze already provides built-in pre-processors. You can find a list and further details on the functionality of the respective processors in the reference documentation.

Where to Go Next

Observation Normalization

For efficient RL training it is crucial that the inputs (e.g. observations) to our models (e.g. policy and value networks) follow a certain distribution and exhibit values living within a certain range. To ensure this precondition Maze provides general and customizable functionality for normalizing the observations returned by the respective environments via the ObservationNormalizationWrapper.

_images/observation_normalization_overview.png

This means to gain access to observation normalization and to the features listed below you simply have to add the ObservationNormalizationWrapper to your wrapper stack in your Hydra configuration.

To get started you can also just copy one of our Hydra config snippets below or use it directly within Python.

List of Features

So far observation normalization supports:

As not all of the features listed above might be required right from the beginning you can find Hydra config examples with increasing complexity below.

Example 1: Normalization with Estimated Statistics

This example applies default observation normalization to all observations with statistics automatically estimated via sampling.

# @package wrappers
maze.core.wrappers.observation_normalization.observation_normalization_wrapper.ObservationNormalizationWrapper:
    # default behaviour
    default_strategy: maze.normalization_strategies.MeanZeroStdOneObservationNormalizationStrategy
    default_strategy_config:
        clip_range: [~, ~]
        axis: ~
    default_statistics: ~
    statistics_dump: statistics.pkl
    sampling_policy:
        _target_: maze.core.agent.random_policy.RandomPolicy
    exclude: ~
    manual_config: ~

Details:

  • Applies mean zero - standard deviation one normalization to all observations contained in the dictionary observation space

  • Does not clip observations after normalization

  • Does not compute individual normalization statistics along different axis of the observation vector / matrix

  • Dumps the normalization statistics to the file “statistics.pkl

  • Estimates the required statistics from observations collected via random sampling

  • Does not exclude any observations from normalization

  • Does not provide any normalization statistics manually

Example 2: Normalization with Manual Statistics

In this example, we manually specify both the default normalization strategy and its corresponding default statistics. This is useful, e.g., when working with RGB pixel observation spaces. However, it requires to know the normalization statistics beforehand.

# @package wrappers
maze.core.wrappers.observation_normalization.observation_normalization_wrapper.ObservationNormalizationWrapper:
    # default behaviour
    default_strategy: maze.normalization_strategies.RangeZeroOneObservationNormalizationStrategy
    default_strategy_config:
        clip_range: [0, 1]
        axis: ~
    default_statistics:
        min: 0
        max: 255
    statistics_dump: statistics.pkl
    sampling_policy:
        _target_: maze.core.agent.random_policy.RandomPolicy
    exclude: ~
    manual_config: ~

Details:

  • Add range-zero-one normalization with manually set statistics to all observations

  • Clips the normalized observation to range [0, 1] in case something goes wrong. (As this example expects RGB pixel observations to have values between 0 and 255 this should not have an effect.)

  • Subtracts 0 from each value contained in the observation vector / matrix and then divides it by 255

  • The remaining settings do not have an effect here

Example 3: Custom Normalization and Excluding Observations

This advanced example shows how to utilize the full feature set of observation normalization. For explanations please see the comments and details below.

# @package wrappers
maze.core.wrappers.observation_normalization.observation_normalization_wrapper.ObservationNormalizationWrapper:
    # default behaviour
    default_strategy: maze.normalization_strategies.MeanZeroStdOneObservationNormalizationStrategy
    default_strategy_config:
        clip_range: [~, ~]
        axis: ~
    default_statistics: ~
    statistics_dump: statistics.pkl
    sampling_policy:
        _target_: maze.core.agent.random_policy.RandomPolicy
    # observation with key action_mask gets excluded from normalization
    exclude: [action_mask]
    manual_config:
        # observation pixel_image uses manually specified normalization statistics
        pixel_image:
          strategy: maze.normalization_strategies.RangeZeroOneObservationNormalizationStrategy
          strategy_config:
            clip_range: [0, 1]
            axis: ~
          statistics:
            min: 0
            max: 255
        # observation feature_vector estimates normalization statistics via sampling
        feature_vector:
          strategy: maze.normalization_strategies.MeanZeroStdOneObservationNormalizationStrategy
          strategy_config:
            clip_range: [-3, 3]
            # normalization statistics are computed along the first axis
            axis: [0]

Details:

  • The default behaviour for observations without manual config is identical to example 1

  • observation pixel_image: behaves as the default in example 2

  • observation feature_vector:

    • By setting axis to [0] in the strategy_config each element in the observation gets normalized with an element-wise mean and standard deviation.

    • Why? A feature_vector has shape (d,). After collecting N observations for computing the normalization statistics we arrive at a stacked feature_vector-matrix with shape (N, 10). By computing the normalization statistics along axis [0] we get normalization statistics with shape (d,) again which can be applied in an elementwise fashion.

    • Additionally each element in the vector is clipped to range [-3, 3].

  • Note, that even though a manual config is provided for some observations you can still decide if you would like to use predefined manual statistics or estimate them from sampled observations.

Example 4: Using Custom Normalization Strategies

In case the normalization strategies provided with Maze are not sufficient for your use case you can of course implement and add your own strategies.

# @package wrappers
maze.core.wrappers.observation_normalization.observation_normalization_wrapper.ObservationNormalizationWrapper:
    # default behaviour
    default_strategy: my_project.normalization_strategies.custom.CustomObservationNormalizationStrategy
    default_strategy_config:
        clip_range: [~, ~]
        axis: ~
    default_statistics: ~
    statistics_dump: statistics.pkl
    sampling_policy:
        _target_: maze.core.agent.random_policy.RandomPolicy
    exclude: ~
    manual_config: ~

When adding a new normalization strategy you (1) have to implement the ObservationNormalizationStrategy interface and (2) make sure that it is accessible within your Python path. Besides that you only have to provide the reference path of the pre-processor to use.

Example 5: Plain Python Configuration

If you are not working with the Maze command line interface but still want to reuse observation normalization directly within Python you can start with the code snippet below. It shows how to:

  • instantiate an observation normalized environment

  • estimate normalization statistics via sampling

  • reuse the estimated statistics for normalization for subsequent tasks such as training or rollouts

"""Contains an example showing how to use observation normalization directly from python."""
from maze.core.agent.random_policy import RandomPolicy
from maze.core.wrappers.maze_gym_env_wrapper import GymMazeEnv
from maze.core.wrappers.observation_normalization.observation_normalization_wrapper import \
    ObservationNormalizationWrapper
from maze.core.wrappers.observation_normalization.observation_normalization_utils import \
    obtain_normalization_statistics

# instantiate a maze environment
env = GymMazeEnv("CartPole-v0")

# this is the normalization config as a python dict
normalization_config = {
    "default_strategy": "maze.normalization_strategies.MeanZeroStdOneObservationNormalizationStrategy",
    "default_strategy_config": {"clip_range": (None, None), "axis": 0},
    "default_statistics": None,
    "statistics_dump": "statistics.pkl",
    "sampling_policy": RandomPolicy(env.action_spaces_dict),
    "exclude": None,
    "manual_config": None
}

# 1. PREPARATION: first we estimate normalization statistics
# ----------------------------------------------------------

# wrap the environment for observation normalization
env = ObservationNormalizationWrapper.wrap(env, **normalization_config)

# before we can start working with normalized observations
# we need to estimate the normalization statistics
normalization_statistics = obtain_normalization_statistics(env, n_samples=1000)

# 2. APPLICATION (training, rollout, deployment)
# ----------------------------------------------

# instantiate a maze environment
training_env = GymMazeEnv("CartPole-v0")
# wrap the environment for observation normalization
training_env = ObservationNormalizationWrapper.wrap(training_env, **normalization_config)

# reuse the estimated the statistics in our training environment(s)
training_env.set_normalization_statistics(normalization_statistics)

# after this step the training env yields normalized observations
normalized_obs = training_env.reset()

Built-in Normalization Strategies

Normalization strategies simply specify the way how input observations are normalized.

Maze already comes with built-in normalization strategies. You can find a list and further details on the functionality of the respective strategies in the reference documentation.

The Bigger Picture

The figure below shows how observation normalization is embedded in the overall interaction loop and set the involved components into context.

It is located in between the ObservationConversionInterface (which converts environment MazeStates into machine readable observations) and the agent.

_images/observation_normalization.png

According to the sampling_policy specified in the config the wrapper collects observations from the interaction loop and uses these to estimate the normalization statistics given the provided normalization strategies.

The statistics get dumped to the pickle file specified in the config for subsequent rollouts or deploying the agent.

If normalization statistics are known beforehand this stage can be skipped by simply providing the statistics manually in the wrapper config.

Where to Go Next

Tricks of the Trade

This page contains a short list of tips and best practices that have been quite useful in our work over the last couple of years and will hopefully also make it easier for you to train your agents. However, you should be aware that not each item below will work in each and every application scenario. Nonetheless, if you are stuck most of them are certainly worth to give a try!

Note

Below you find a subjective and certainly not complete collection of RL tips and tricks that will hopefully continue to grow over time. However, if you stumble upon something crucial that is missing from the list, which you would like to share with the RL community and us do not hesitate to get in touch and discuss with us! email github_mark

Learning and Optimization

tick Action Masking

Use action masking whenever possible! This can be crucial as it has the potential to drastically reduce the exploration space of your problem, which usually leads to a reduced learning time and better overall results. In some cases action masking also mitigates the need for reward shaping as invalid actions are excluded from sampling and there is no need to penalize them with negative rewards any more. If you want to learn more we recommend to check out the tutorial on structured environments and action masking.

tick Reward Scaling and Shaping

Make sure that your step rewards are in a reasonable range (e.g., [-1, 1]) not spanning various orders of magnitude. If these conditions are not fulfilled you might want to apply reward scaling or clipping (see RewardScalingWrapper, RewardClippingWrapper) or manually shape your reward.

tick Reward and Key Performance Indicator (KPI) Monitoring

When optimizing multi-target objectives (e.g., a weighted sum of sub-rewards) consider to monitor the contributing rewards on an individual basis. Even though the overall reward appears to not improve anymore it might still be the case that the contributing sub-rewards change or fluctuate in the background. This indicates that the policy and in turn the behaviour of your agent is still changing. In such settings we recommend to watch the learning progress by monitoring KPIs.

Models and Networks

tick Network Design

Design use case and task specific custom network architectures whenever required. In a straight forward case this might be a CNN when processing image observations but it could also be a Graph Convolution Network (GCN) when working with graph or grid observations. To do so, you might want to check out the Perception Module, the built-in network building blocks as well as the section on how to work with custom models.

Further, you might want to consider behavioural cloning (BC) to design and tweak

  • the network architectures

  • the observations that are fed into these models

This requires that an imitation learning dataset fulfilling the pre-conditions for supervised learning is available. If so, incorporating BC into the model and observation design process can save a lot of time and compute as you are now training in a supervised learning setting. Intuition: If a network architecture, given the corresponding observations, is able to fit an offline trajectory dataset (without severe over-fitting) it might also be a good choice for actual RL training. If this is relevant to you, you can follow up on how to employ imitation learning with Maze.

tick Continuous Action Spaces

When facing bounded continuous action spaces use Squashed Gaussian or Beta probability distributions for your action heads instead of an unbounded Gaussian. This avoids action clipping and limits the space of explorable actions to valid regions. You can learn in the section about distributions and acton heads how you can easily switch between different probability distributions using the DistributionMapper.

tick Action Head Biasing

If you would like to incorporate prior knowledge about the selection frequency of certain actions you could consider to bias the output layers of these action heads towards the expected sampling distribution after randomly initializing the weights of your networks (e.g., compute_sigmoid_bias).

Observations

tick Observation Normalization

For efficient RL training it is crucial that the inputs (e.g. observations) to our models (e.g. policy and value networks) follow a certain distribution and exhibit values within certain ranges. To ensure this precondition consider to normalize your observations before actual training by either:

  • manually specifying normalization statistics (e.g, divide by 255 for uint8 RGB image observations)

  • compute statistics from observations sampled by interacting with the environment

As this is a recurring, boilerplate code heavy task, Maze already provides built-in customizable functionality for normalizing the observations.

tick Observation Pre-Processing

When feeding categorical observations to your models consider to convert them to their one-hot encoded vectorized counterparts. This representation is better suited for neural network processing and a common practice for example in Natural Language Processing (NLP). In Maze you can achieve this via observation pre-processing and the OneHotPreProcessor.

Cheat Sheet

Run a rollout to test an environment with random action sampling:

maze-run -cn conf_rollout env.name=CartPole-v1 policy=random_policy

Run a rollout and render the state of the environment:

maze-run -cn conf_rollout env.name=CartPole-v1 policy=random_policy \
runner=sequential runner.render=true

Train a policy with evolutionary strategies (ES):

maze-run -cn conf_train env.name=CartPole-v1 algorithm=es model=vector_obs

Train a policy with with an actor-critic trainer such as A2C:

maze-run -cn conf_train env.name=CartPole-v1 algorithm=a2c \
model=vector_obs critic=template_state

Resume training from a previous model state:

maze-run -cn conf_train env.name=CartPole-v1 algorithm=a2c \
model=vector_obs critic=template_state input_dir=outputs/<experiment-dir>

Run a rollout of a policy, trained with the command above:

maze-run -cn conf_rollout env.name=CartPole-v1 model=vector_obs \
policy=torch_policy input_dir=outputs/<experiment-dir>

Integrating an Existing Gym Environment

Maze supports a seamless integration of existing OpenAI Gym environments. This holds for already registered, built-in Gym environments but also for any other custom environment following the Gym environments interface.

To get full Maze feature support for Gym environments we first have to transform them into Maze environments. This page shows how this is easily accomplished via the GymMazeEnv.

_images/gym_env_wrapper.png

In short, a Gym environment is transformed into a MazeEnv by wrapping it with the GymMazeEnv. Under the hood the GymMazeEnv automatically:

  1. Transforms the Gym environment into a GymCoreEnv.

  2. Transforms the observation and action spaces into a dictionary spaces via the GymObservationConversion and GymActionConversion interfaces.

  3. Packs the GymCoreEnv into a MazeEnv which is fully compatible with all other Maze components and modules.

To get a better understanding of the overall structure please see the Maze environment hierarchy.

Instantiating a Gym Environment as a Maze Environment

The config snippet below shows how to instantiate an existing, already registered Gym environment as a GymMazeEnv referenced by its environment name (here CartPole-v0).

# @package env
_target_: maze.core.wrappers.maze_gym_env_wrapper.make_gym_maze_env
name: CartPole-v0

To achieve the same result directly with plain Python you can start with the code snippet below.

from maze.core.wrappers.maze_gym_env_wrapper import GymMazeEnv
env = GymMazeEnv(env="CartPole-v0")

In case your custom Gym environment is not yet registered with Gym, you can also explicitly instantiate the environment before passing it to the GymMazeEnv.

from maze.core.wrappers.maze_gym_env_wrapper import GymMazeEnv
custom_gym_env = CustomGymEnv()
env = GymMazeEnv(env=custom_gym_env)

Test your own Gym Environment with Maze

If you already have a project set up and would like to test Maze with your own environment the quickest way to get started is to:

First, make sure that your project is either installed or available in your PYTHONPATH.

Second, add an environment factory function similar to the one shown in the snippet below to your project (e.g., my_project/env_factory.py).

from maze.core.env.maze_env import MazeEnv
from maze.core.wrappers.maze_gym_env_wrapper import GymMazeEnv

def make_env(name: str) -> MazeEnv:
    custom_gym_env = CustomGymEnv()
    return GymMazeEnv(custom_gym_env)

That’s all we need to do. You can now start training an agent for your environment by running:

$ maze-run -cn conf_train env._target_=my_project.env_factory.make_env

This basically updates the original gym_env config via Hydra overrides.

Note that the argument name is unused so far but is required to adhere to the gym_env config signature. When creating your own config files you can of course tailor this signature to your needs.

Where to Go Next

Structured Environments and Action Masking

This tutorial provides a step by step guide explaining how to implement a decision problem as a structured environment and how to train an agent for such a StructuredEnv with a structured Maze Trainer. The examples are again based on the online version of the Guillotine 2D Cutting Stock Problem which is a perfect fit for introducing the underlying concepts.

In particular, we will see how to evolve the performance of an RL agent by going through the following stages:

  1. Flat Gym-style environment with vanilla feed forward models

  2. Structured environment (e.g., with hierarchical sub-steps) with task specific policy networks

  3. Structured environment (e.g., with hierarchical sub-steps) with masked task specific policy networks

_images/tb_reward_detail.png

Before diving into this tutorial we recommend to familiarize yourself with Control Flows with Structured Environments and the basic Maze - step by step tutorial.

The remainder of this tutorial is structured as follows:

Turning a “flat” MazeEnv into a StructuredEnv

In this part of the tutorial we will learn how to reformulate an RL problem in order to turn it from a “flat” Gym-style environment into a structured environment.

The complete code for this part of the tutorial can be found here

# relevant files
- cutting_2d
    - main.py
    - env
        - struct_env.py

Analyzing the Problem Structure

Before we start implementing the structured environment lets first revisit the cutting 2D problem. In particular, we put our attention to the joint action space consisting of the following components:

  • Action \(a_0\): cutting piece selection (decides which piece from inventory to use for cutting)

  • Action \(a_1\): cutting orientation selection (decides on the orientation of the cut)

  • Action \(a_2\): cutting order selection (decides which cut to take first; x or y)

_images/cutting_parameters.png

Analysis of Action Space and Problem:

  • We are facing a combinatorial action space with \(O(N \cdot 2 \cdot 2)\) possible actions the agent has to choose from in each step. \(N\) is the maximum number of pieces stored in the inventory.

  • Sampling from this joint action space might result in invalid cutting configurations. This is because the three sub-actions are treated independently from each other. For the problem at hand this is obviously not the case.

  • It would be much more intuitive to sample the sub-actions sequentially and conditioned on each other. (E.g., it seems to be easier to decide on the cutting order and orientation once we know the piece we will cut from.)

Implementing the Structured Environment

We now address the issues discovered in the previous section and re-formulate the cutting 2D problem as a StructuredEnv with the following two sub-steps:

  • Select cutting piece from inventory given inventory state and customer order.

  • Select cutting configuration (cutting order and cutting orientation) given customer order and inventory cutting piece selected in the previous sub-step.

This could be also described with the modified agent environment interaction loop shown in the figure below. Note that the both observation and action space differ between the selection and the cutting sub-step. For the present example, reward is only granted once the cutting sub-step (i.e., the second step) is complete.

_images/sub_step_interaction.png

Note

Conceptually structured environments and conditional sub-steps are related to auto-regressive action spaces where subsequent actions are sampled conditioned on their predecessors. [e.g. DeepMind (2019), “Grandmaster level in StarCraft II using multi-agent reinforcement learning.”]

The code for the StructuredCutting2DEnvironment below implements exactly this interaction pattern.

env/struct_env.py
from copy import deepcopy
from typing import Dict, Any, Union, Tuple, Optional, List

import gym
import numpy as np
from maze.core.env.maze_action import MazeActionType
from maze.core.env.maze_env import MazeEnv
from maze.core.env.maze_state import MazeStateType
from maze.core.env.structured_env import StructuredEnv, ActorID
from maze.core.env.structured_env_spaces_mixin import StructuredEnvSpacesMixin
from maze.core.wrappers.wrapper import Wrapper
from .maze_env import maze_env_factory


class StructuredCutting2DEnvironment(Wrapper[MazeEnv], StructuredEnv, StructuredEnvSpacesMixin):
    """Structured environment version of the cutting 2D environment.
    The environment alternates between the two sub-steps:

    - Select cutting piece
    - Select cutting configuration (cutting order and cutting orientation)

    :param maze_env: The "flat" cutting 2D environment to wrap.
    """

    def __init__(self, maze_env: MazeEnv):
        Wrapper.__init__(self, maze_env)

        # define sub-step action spaces
        self._action_spaces_dict = {
            0: gym.spaces.Dict({"piece_idx": maze_env.action_space["piece_idx"]}),
            1: gym.spaces.Dict({"cut_rotation": maze_env.action_space["cut_rotation"],
                                "cut_order": maze_env.action_space["cut_order"]})
        }

        # define sub-step observation spaces
        flat_space = maze_env.observation_space
        self._observation_spaces_dict = {
            0: flat_space,
            1: gym.spaces.Dict({"selected_piece": flat_space["ordered_piece"],
                                "ordered_piece": flat_space["ordered_piece"]})
        }

        self._flat_obs = None
        self._action_0 = None
        self._sub_step_key = 0
        self._last_reward = None  # Last reward obtained from the underlying environment

    def step(self, action):
        """Generic step function alternating between the two sub-steps.
        :return: obs, rew, done, info
        """
        # sub-step: Select cutting piece
        if self._sub_step_key == 0:
            sub_step_result = self._selection_step(action)
        # sub-step: Select cutting configuration
        elif self._sub_step_key == 1:
            sub_step_result = self._cutting_step(action)
        else:
            raise ValueError("Sub-step id {} not allowed for this environment!".format(self._sub_step_key))

        # alternate step index
        self._sub_step_key = np.mod(self._sub_step_key + 1, 2)

        return sub_step_result

    def reset(self) -> Any:
        """Resets the environment and returns the initial state.
        :return: The initial state after resetting.
        """
        self._flat_obs = self.env.reset()
        self._flat_obs["ordered_piece"] = self._flat_obs["ordered_piece"]

        self._sub_step_key = 0
        return self._obs_selection_step(self._flat_obs)

    @staticmethod
    def _obs_selection_step(flat_obs: Dict[str, np.array]) -> Dict[str, np.array]:
        """Formats initial observation / observation available for the first sub-step."""
        return deepcopy(flat_obs)

    @staticmethod
    def _obs_cutting_step(flat_obs: Dict[str, np.array], selected_piece_idx: int) -> Dict[str, np.array]:
        """Formats observation available for the second sub-step."""
        return {"selected_piece": flat_obs["inventory"][selected_piece_idx],
                "ordered_piece": flat_obs["ordered_piece"]}

    def _selection_step(self, action: Dict[str, int]) -> Tuple[Dict[str, np.ndarray], float, bool, Dict]:
        """Cutting piece selection step."""
        self._action_0 = action
        obs = self._obs_cutting_step(self._flat_obs, action["piece_idx"])
        return obs, 0.0, False, {}

    def _cutting_step(self, action: Dict[str, int]) -> Tuple[Dict[str, np.ndarray], float, bool, Dict]:
        """Cutting rotation and cutting order selection step."""
        flat_action = {"piece_idx": self._action_0["piece_idx"],
                       "cut_rotation": action["cut_rotation"],
                       "cut_order": action["cut_order"]}

        self._flat_obs, self._last_reward, done, info = self.env.step(flat_action)
        self._flat_obs["ordered_piece"] = self._flat_obs["ordered_piece"]

        return self._obs_selection_step(self._flat_obs), self._last_reward, done, info

    def actor_id(self) -> ActorID:
        """Returns the currently executed actor along with the policy id. The id is unique only with
        respect to the policies (every policy has its own actor 0).
        Note that identities of done actors can not be reused in the same rollout.

        :return: The current actor, as tuple (policy id, actor number).
        """
        return ActorID(step_key=self._sub_step_key, agent_id=0)

    def get_actor_rewards(self) -> Optional[np.ndarray]:
        """Returns rewards attributed to individual actors after the step has been done. This is necessary,
        as after the first sub-step (i.e., piece selection), the full reward is not yet available, so zero
        reward is returned instead. The second (= last) sub-step then returns joint reward for all (both) actors.

        With this method, we can attribute parts of the reward to the individual actors, which is useful for example
        if each has its own separate critic.

        In this case, we attribute half of the reward to each actor.
        """
        return np.array([self._last_reward / 2.0] * 2)

    @property
    def agent_counts_dict(self) -> Dict[Union[str, int], int]:
        """Returns the count of agents for individual sub-steps (or -1 for dynamic agent count).

        This env has two sub-steps (0 and 1), in each of which one agent gets to act. Hence, we return
        {0: 1, 1: 1}.
        """
        return {0: 1, 1: 1}

    def is_actor_done(self) -> bool:
        """Returns True if the just stepped actor is done, which is different to the done flag of the environment."""
        return False

    @property
    def action_space(self) -> gym.spaces.Dict:
        """Implementation of :class:`~maze.core.env.structured_env_spaces_mixin.StructuredEnvSpacesMixin` interface."""
        return self._action_spaces_dict[self._sub_step_key]

    @property
    def observation_space(self) -> gym.spaces.Dict:
        """Implementation of :class:`~maze.core.env.structured_env_spaces_mixin.StructuredEnvSpacesMixin` interface."""
        return self._observation_spaces_dict[self._sub_step_key]

    @property
    def action_spaces_dict(self) -> Dict[Union[int, str], gym.spaces.Dict]:
        """Implementation of :class:`~maze.core.env.structured_env_spaces_mixin.StructuredEnvSpacesMixin` interface."""
        return self._action_spaces_dict

    @property
    def observation_spaces_dict(self) -> Dict[Union[int, str], gym.spaces.Dict]:
        """Implementation of :class:`~maze.core.env.structured_env_spaces_mixin.StructuredEnvSpacesMixin` interface."""
        return self._observation_spaces_dict

    def seed(self, seed: int = None) -> None:
        """Sets the seed for this environment's random number generator(s).
        :param: seed: the seed integer initializing the random number generator.
        """
        self.env.seed(seed)

    def close(self) -> None:
        """Performs any necessary cleanup."""
        self.env.close()

    def get_observation_and_action_dicts(self, maze_state: MazeStateType, maze_action: MazeActionType,
                                         first_step_in_episode: bool) \
            -> Tuple[Optional[Dict[Union[int, str], Any]], Optional[Dict[Union[int, str], Any]]]:
        """Convert the flat action and MazeAction from Maze env into the structured ones.

        Note that both MazeState and MazeAction needs to be supplied together, otherwise actions/observations for the
        individual sub-steps cannot be produced.
        """
        assert maze_state is not None and maze_action is not None,\
            "This wrapper needs both MazeState and MazeAction for the conversion (as there are multiple sub-steps)."
        observation_dict, action_dict = self.env.get_observation_and_action_dicts(maze_state, maze_action,
                                                                                  first_step_in_episode)
        assert len(observation_dict.items()) == 1 and len(action_dict.items()) == 1, "wrapped env should be single-step"

        flat_action = list(action_dict.values())[0]
        flat_obs = list(observation_dict.values())[0]

        flat_obs["ordered_piece"] = flat_obs["ordered_piece"]

        obs_dict = {
            0: self._obs_selection_step(flat_obs),
            1: self._obs_cutting_step(flat_obs, flat_action["piece_idx"])
        }

        act_dict = {
            0: {k: flat_action[k] for k in ["piece_idx"]},
            1: {k: flat_action[k] for k in ["cut_rotation", "cut_order"]}
        }

        return obs_dict, act_dict


def struct_env_factory(max_pieces_in_inventory: int, raw_piece_size: Tuple[int, int],
                       static_demand: List[Tuple[int, int]]) -> StructuredCutting2DEnvironment:
    """Convenience factory function that compiles a trainable structured environment.
    (for argument details see: Cutting2DEnvironment)
    """

    # init maze environment including observation and action interfaces
    env = maze_env_factory(max_pieces_in_inventory=max_pieces_in_inventory,
                           raw_piece_size=raw_piece_size,
                           static_demand=static_demand)

    # convert flat to structured environment
    return StructuredCutting2DEnvironment(env)

Test Script

The following snippet first instantiates the structured environment and then performs one cycle of the structured agent environment interaction loop.

main.py
""" Test script CoreEnv """
from tutorial_maze_env.part06_struct_env.env.struct_env import struct_env_factory


def main():
    # init maze environment including observation and action interfaces
    struct_env = struct_env_factory(max_pieces_in_inventory=200,
                                    raw_piece_size=(100, 100),
                                    static_demand=[(30, 15)])

    # reset env
    obs_step1 = struct_env.reset()

    print("action_space 1:     ", struct_env.action_space)
    print("observation_space 1:", struct_env.observation_space)
    print("observation 1:      ", obs_step1.keys())

    # take first env step
    action_1 = struct_env.action_space.sample()
    obs_step2, rew, done, info = struct_env.step(action=action_1)

    print("action_space 2:     ", struct_env.action_space)
    print("observation_space 2:", struct_env.observation_space)
    print("observation 2:      ", obs_step2.keys())

    # take second env step
    action_2 = struct_env.action_space.sample()
    obs_step1 = struct_env.step(action=action_2)


if __name__ == "__main__":
    """ main """
    main()

Running the script will print the following output. Note that the observation and action spaces alternate from sub-step to sub-step.

action_space 1:      Dict(piece_idx:Discrete(200))
observation_space 1: Dict(inventory:Box(200, 2), inventory_size:Box(1,), order:Box(2,))
observation 1:       dict_keys(['inventory', 'inventory_size', 'order'])
action_space 2:      Dict(order:Discrete(2), rotation:Discrete(2))
observation_space 2: Dict(order:Box(2,), selected_piece:Box(1, 2))
observation 2:       dict_keys(['selected_piece', 'order'])

In the next part of this tutorial we will train an agent on this structured environment.

Training the Structured Environment

In this part of the tutorial we will learn how to train an agent with a Maze trainer implicitly supporting a Structured Environment. We will also design a policy network architecture matching the task at hand.

The complete code for this part of the tutorial can be found here

# relevant files
- cutting_2d
    - conf
        - env
            - tutorial_cutting_2d_flat.yaml
            - tutorial_cutting_2d_struct.yaml
        - model
            - tutorial_cutting_2d_flat.yaml
            - tutorial_cutting_2d_struct.yaml
        - wrappers
            - tutorial_cutting_2d.yaml
    - models
        - actor.py
        - critic.py

A Simple Problem Setting

To emphasize the effects of action masking throughout this tutorial we devise a simple problem instance of the cutting 2d environment with the following properties:

_images/simple_problem_sketch.png

Given the raw piece size and the items in the static demand (appear in an alternating fashion) we can cut six customer orders from one raw inventory piece. When limiting the episode length to 180 time steps the optimal solution with respect to new raw pieces from inventory is 31 (30 + 1 because the environment adds a new piece whenever the current one has been cut).

Task-Specific Actor-Critic Model

For this advanced tutorial we make use of Maze custom models to compose an actor-critic architecture that is geared towards the respective sub-tasks. Our structured environment requires two policies, one for piece selection and one for cutting parametrization. For each of the two sub-step policies we also build a distinct state critic (see StepStateCriticComposer for details). Note that it would be also possible to employ a SharedStateCriticComposer used to compute the advantages for both policies.

The images below show the for network architectures (click to view in large). For further details on how to build the models we refer to the accompanying repository and the section on how to work with custom models.

Piece Selection Policy
Cutting Policy
Piece Selection Critic
Cutting Critic
_images/policy_0.png
_images/policy_1.png
_images/critic_0.png
_images/critic_1.png

Some notes on the models:

  • The selection policy takes the current inventory and the ordered piece as input and predicts a selection probability (piece_idx) for each inventory option.

  • The cutting policy takes the ordered piece and the selected piece (previous step) as input and predicts cutting rotation and cutting order.

  • The critic models have an analogous structure but predict the state-value instead of action logits.

Multi-Step Training

Given the models designed in the previous section we are now ready to train our first agent on a Structured Environment. We already mentioned that Maze trainers directly support the training of Structured Environments such as the StructuredCutting2DEnvironment implemented in the previous part of this tutorial.

To start training a cutting policy with the PPO trainer, run:

maze-run -cn conf_train env=tutorial_cutting_2d_struct wrappers=tutorial_cutting_2d \
model=tutorial_cutting_2d_struct algorithm=ppo

As usual, we can watch the training progress with Tensorboard.

tensorboard --logdir outputs
_images/tb_struct_reward.png

We can see that the reward slowly approaches the optimum. Note that the performance of this agent is already much better than the vanilla Gym-style model we employed in the introductory tutorial (compare evolution of rewards above).

However, the event logs also reveal that the agent initially samples many invalid actions (e.g, invalid_cut and invalid_piece_selected). This is sample inefficient and slows down the learning progress.

Next, we will further improve the agent by avoiding sampling of these invalid choices via action masking.

Adding Step-Conditional Action Masking

In this part of the tutorial we will learn how to substantially increase the sample efficiency of our agents by adding sub-step conditional action masking to the structured environment.

The complete code for this part of the tutorial can be found here

# relevant files
- cutting_2d
    - main.py
    - env
        - struct_env_masked.py

In particular, we will add two different masks:

  • Inventory_mask: allows to only select cutting pieces from inventory slots actually holding a piece that would allow to fulfill the customer order.

  • Rotation_mask: allows to only specify valid cutting rotations (e.g., the ordered piece fits into the cutting piece from inventory). Note that providing this mask is only possible once the cutting piece has been selected in the first sub-step - hence the name step-conditional masking.

The figure below provides a sketch of the two masks.

_images/cutting2d_masking.png

Only the first two inventory pieces are able to fit the customer order. The four rightmost inventory slots do not hold a piece at all and are also masked. When rotating the piece by 90° for cutting the customer order would not fit into the selected inventory piece which is why we can simply mask this option.

Masked Structured Environment

One way to incorporate the two masks in our structured environment is to simply inherit from the initial version and extend it by the following changes:

  • Add the two masks to the observation spaces (e.g., inventory_mask and cutting_mask)

  • Compute the actual mask for the two sub-steps in the respective functions (e.g., _obs_selection_step and _obs_cutting_step).

env/struct_env_masked.py
from copy import deepcopy
from typing import Dict, List, Tuple

import gym
import numpy as np
from tutorial_maze_env.part06_struct_env.env.maze_env import maze_env_factory
from tutorial_maze_env.part06_struct_env.env.struct_env import StructuredCutting2DEnvironment
from maze.core.env.maze_env import MazeEnv


class MaskedStructuredCutting2DEnvironment(StructuredCutting2DEnvironment):
    """Structured environment version of the cutting 2D environment.
    The environment alternates between the two sub-steps:

    - Select cutting piece
    - Select cutting configuration (cutting order and cutting orientation)

    :param maze_env: The "flat" cutting 2D environment to wrap.
    """

    def __init__(self, maze_env: MazeEnv):
        super().__init__(maze_env)

        # add masks to observation spaces
        max_inventory = self.observation_conversion.max_pieces_in_inventory
        self._observation_spaces_dict[0].spaces["inventory_mask"] = \
            gym.spaces.Box(low=np.float32(0), high=np.float32(1), shape=(max_inventory,), dtype=np.float32)

        self._observation_spaces_dict[1].spaces["cutting_mask"] = \
            gym.spaces.Box(low=np.float32(0), high=np.float32(1), shape=(2,), dtype=np.float32)

    @staticmethod
    def _obs_selection_step(flat_obs: Dict[str, np.array]) -> Dict[str, np.array]:
        """Formats initial observation / observation available for the first sub-step."""
        observation = deepcopy(flat_obs)

        # prepare inventory mask
        sorted_order = np.sort(observation["ordered_piece"].flatten())
        sorted_inventory = np.sort(observation["inventory"], axis=1)

        observation["inventory_mask"] = np.all(observation["inventory"] > 0, axis=1).astype(np.float32)
        for i in np.nonzero(observation["inventory_mask"])[0]:
            # exclude pieces which do not fit
            observation["inventory_mask"][i] = np.all(sorted_order <= sorted_inventory[i])

        return observation

    @staticmethod
    def _obs_cutting_step(flat_obs: Dict[str, np.array], selected_piece_idx: int) -> Dict[str, np.array]:
        """Formats observation available for the second sub-step."""

        selected_piece = flat_obs["inventory"][selected_piece_idx]
        ordered_piece = flat_obs["ordered_piece"]

        # prepare cutting action mask
        cutting_mask = np.zeros((2,), dtype=np.float32)

        selected_piece = selected_piece.squeeze()
        if np.all(flat_obs["ordered_piece"] <= selected_piece):
            cutting_mask[0] = 1.0

        if np.all(flat_obs["ordered_piece"][::-1] <= selected_piece):
            cutting_mask[1] = 1.0

        return {"selected_piece": selected_piece,
                "ordered_piece": ordered_piece,
                "cutting_mask": cutting_mask}


def struct_env_factory(max_pieces_in_inventory: int, raw_piece_size: Tuple[int, int],
                       static_demand: List[Tuple[int, int]]) -> StructuredCutting2DEnvironment:
    """Convenience factory function that compiles a trainable structured environment.
    (for argument details see: Cutting2DEnvironment)
    """

    # init maze environment including observation and action interfaces
    env = maze_env_factory(max_pieces_in_inventory=max_pieces_in_inventory,
                           raw_piece_size=raw_piece_size,
                           static_demand=static_demand)

    # convert flat to structured environment
    return MaskedStructuredCutting2DEnvironment(env)

Test Script

When re-running the main script of the previous section with the masked version of the structured environment we now get the following output:

action_space 1:      Dict(piece_idx:Discrete(200))
observation_space 1: Dict(inventory:Box(200, 2), inventory_size:Box(1,), ordered_piece:Box(2,), inventory_mask:Box(200,))
observation 1:       dict_keys(['inventory', 'inventory_size', 'ordered_piece', 'inventory_mask'])
action_space 2:      Dict(cut_order:Discrete(2), cut_rotation:Discrete(2))
observation_space 2: Dict(ordered_piece:Box(2,), selected_piece:Box(2,), cutting_mask:Box(2,))
observation 2:       dict_keys(['selected_piece', 'ordered_piece', 'cutting_mask'])

As expected, both masks are contained in the respective observations and spaces. In the next section we will utilize these masks to enhance the sample efficiency ouf our trainers.

Training with Action Masking

In this part of the tutorial we will retrain the environment with step-conditional action masking activated and benchmark it with the initial, unmasked version.

The complete code for this part of the tutorial can be found here

# relevant files
- cutting_2d
    - conf
        - env
            - tutorial_cutting_2d_flat.yaml
            - tutorial_cutting_2d_struct.yaml
            - tutorial_cutting_2d_struct_masked.yaml
        - model
            - tutorial_cutting_2d_flat.yaml
            - tutorial_cutting_2d_struct.yaml
            - tutorial_cutting_2d_struct_masked.yaml
        - wrappers
            - tutorial_cutting_2d.yaml
    - models
        - actor.py
        - critic.py

Masked Policy Models

Before we can retrain the masked version of the structured environment we first need to specify how the masks are employed within the models. For this purpose we extend the two policy models with an ActionMaskingBlock applied to the respective logits. The resulting models are shown below:

Masked Piece
Selection Policy
Masked Cutting Policy
Piece Selection Critic
Cutting Critic
_images/policy_01.png
_images/policy_11.png
_images/critic_01.png
_images/critic_11.png

Retraining with Masking

maze-run -cn conf_train env=tutorial_cutting_2d_struct_masked wrappers=tutorial_cutting_2d \
model=tutorial_cutting_2d_struct_masked algorithm=ppo

Once training has finished we can again inspect the progress with Tensorboard. To get a better feeling for the effect of action masking we benchmark the following versions of the environment:

  • Flat Gym-style environment with vanilla feed forward models (red)

  • Structured Environment (e.g., with hierarchical sub-steps) with task specific policy networks (orange)

  • Structured Environment (e.g., with hierarchical sub-steps) with masked, task specific policy networks (blue)

_images/tb_reward.png

First of all we can observe a massive increase in learning speed when activating action masking. In fact the reward of the masked model starts at an much higher initial value. We can also observe a substantial improvement when switching from the vanilla feed forward Gym-style example (red) to the structured environment using task specific custom models (orange).

In Depth Inspection of Learning Progress

In this section we make use of Maze Event Logging System to learn more about the learning progress and behaviour of the respective versions.

_images/tb_events.png
  • When looking at the cutting events we see that the agent utilizing action masking only performs valid cutting attempts right from the beginning of the training process. Avoiding the part where the agent has to learn via reward shaping which cuts are actually possible allows it to focus on learning how to cut efficiently. For the two other versions exactly the latter is the case.

  • The same observation holds for the piece selection policy where again a lot of invalid attempts take place for the two unmasked versions.

  • Finally, when looking at the inventory statistics we can see that the masked agent keeps very few pieces in inventory (pieces in inventory) which is why it never has to discard any piece (pieces discarded) that might be required to fulfill upcoming customer orders.

Turning a “flat” MazeEnv into a StructuredEnv
We will reformulate the problem from a “flat” Gym-style environment into a structured environment.

Training the Structured Environment
We will train the structured environment with a Maze Trainer.

Adding Step-Conditional Action Masking
We will learn how to substantially increase the sample efficiency by adding step-conditional action masking.

Training with Action Masking
We will retrain the structured environment with step-conditional action masking activated and benchmark it with the initial version environment.

Combining Maze with other RL Frameworks

This tutorial explains how to use general Maze features in combination with existing RL frameworks. In particular, we will apply observation normalization before optimizing a policy with the stable-baselines3 A2C trainer. When adding new features to Maze we put a strong emphasis on reusablity to allow you to make use of as much of these features as possible but still give you the opportunity to stick to the optimization framework you are most comfortable or familiar with.

Since RLlib already has a dedicated spot within Maze we rely on stable-baselines3 for this tutorial. However, it is important to note that the examples below will also work with any other Python-based RL framework compatible with Gym environments.

We provide two different versions showing how to arrive at an observation normalized environment. The first one is written in plain Python where the second reproduces the Python example with a Hydra configuration.

Note

Although, this tutorial explains how to reuse observation normalization there is of course no limitation to this sole feature. So if you find this useful we definitely recommend you to browse through our Environment Customization section in the sidebar.

Reusing Environment Customization Features

The basis for this tutorial is the official getting started snippet of stable-baselines showing how to train and run A2C on a CartPole environment. We added a few comments to make things a bit more explicit.

If you would like to run this example yourself make sure to install stable-baselines3 first.

"""
Getting started example from:
https://stable-baselines3.readthedocs.io/en/master/guide/quickstart.html
"""

import gym
from stable_baselines3 import A2C

# ENV INSTANTIATION
# -----------------
env = gym.make('CartPole-v0')

# TRAINING AND ROLLOUT
# --------------------

model = A2C('MlpPolicy', env, verbose=1)
model.learn(total_timesteps=10000)

obs = env.reset()
for i in range(1000):
    action, _state = model.predict(obs, deterministic=True)
    obs, reward, done, info = env.step(action)
    env.render()
    if done:
      obs = env.reset()

Below you find exactly the same example but with an observation normalized environment. The following modifications compared to the example above are required:

  • Instantiate a GymMazeEnv instead of a standard Gym environment

  • Wrap the environment with the ObservationNormalizationWrapper

  • Estimate normalization statistics from actual environment interactions

As you might already have experienced, re-coding these steps for different environments and experiments can get quite cumbersome. The wrapper also dumps the estimated statistics in a file (statistics.pkl) to reuse them later on for agent deployment.

"""
Contains an example showing how to train
an observation normalized maze environment with stable-baselines.
"""

from maze.core.agent.random_policy import RandomPolicy
from maze.core.wrappers.maze_gym_env_wrapper import GymMazeEnv
from maze.core.wrappers.no_dict_spaces_wrapper import NoDictSpacesWrapper
from maze.core.wrappers.observation_normalization.observation_normalization_utils import \
    obtain_normalization_statistics
from maze.core.wrappers.observation_normalization.observation_normalization_wrapper import \
    ObservationNormalizationWrapper

from stable_baselines3 import A2C

# ENV INSTANTIATION: a GymMazeEnv instead of a gym.Env
# ----------------------------------------------------
env = GymMazeEnv('CartPole-v0')

# OBSERVATION NORMALIZATION
# -------------------------

# we wrap the environment with the ObservationNormalizationWrapper
# (you can find details on this in the section on observation normalization)
env = ObservationNormalizationWrapper(
    env=env,
    default_strategy="maze.normalization_strategies.MeanZeroStdOneObservationNormalizationStrategy",
    default_strategy_config={"clip_range": (None, None), "axis": 0},
    default_statistics=None, statistics_dump="statistics.pkl",
    sampling_policy=RandomPolicy(env.action_spaces_dict),
    exclude=None, manual_config=None)

# next we estimate the normalization statistics by
# (1) collecting observations by randomly sampling 1000 transitions from the environment
# (2) computing the statistics according to the define normalization strategy
normalization_statistics = obtain_normalization_statistics(env, n_samples=1000)
env.set_normalization_statistics(normalization_statistics)

# after this step all observations returned by the environment will be normalized

# stable-baselines does not support dict spaces so we have to remove them
env = NoDictSpacesWrapper(env)

# TRAINING AND ROLLOUT (remains unchanged)
# ----------------------------------------

model = A2C('MlpPolicy', env, verbose=1)
model.learn(total_timesteps=10000)

obs = env.reset()
for i in range(1000):
    action, _state = model.predict(obs, deterministic=True)
    obs, reward, done, info = env.step(action)
    env.render()
    if done:
        obs = env.reset()

Reusing the Hydra Configuration System

This example is identical to the the previous one but instead of instantiated everything directly from Python it utilizes the Hydra configuration system.

"""
Contains an example showing how to train an observation normalized maze environment
instantiated from a hydra config with stable-baselines.
"""

from maze.core.utils.config_utils import make_env_from_hydra
from maze.core.wrappers.no_dict_spaces_wrapper import NoDictSpacesWrapper
from maze.core.wrappers.observation_normalization.observation_normalization_utils import \
    obtain_normalization_statistics

from stable_baselines3 import A2C

# ENV INSTANTIATION: from hydra config file
# -----------------------------------------
env = make_env_from_hydra("conf")

# OBSERVATION NORMALIZATION
# -------------------------

# next we estimate the normalization statistics by
# (1) collecting observations by randomly sampling 1000 transitions from the environment
# (2) computing the statistics according to the define normalization strategy
normalization_statistics = obtain_normalization_statistics(env, n_samples=1000)
env.set_normalization_statistics(normalization_statistics)

# stable-baselines does not support dict spaces so we have to remove them
env = NoDictSpacesWrapper(env)

# TRAINING AND ROLLOUT (remains unchanged)
# ----------------------------------------

model = A2C('MlpPolicy', env, verbose=1)
model.learn(total_timesteps=10000)

obs = env.reset()
for i in range(1000):
    action, _state = model.predict(obs, deterministic=True)
    obs, reward, done, info = env.step(action)
    env.render()
    if done:
        obs = env.reset()

This is the corresponding hydra config:

# @package _global_

# defines environment to instantiate
env:
  _target_: maze.core.wrappers.maze_gym_env_wrapper.GymMazeEnv
  env: "CartPole-v0"

# defines wrappers to apply
wrappers:
  # Observation Normalization Wrapper
  maze.core.wrappers.observation_normalization.observation_normalization_wrapper.ObservationNormalizationWrapper:
    default_strategy: maze.normalization_strategies.MeanZeroStdOneObservationNormalizationStrategy
    default_strategy_config:
      clip_range: [~, ~]
      axis: 0
    default_statistics: ~
    statistics_dump: statistics.pkl
    sampling_policy:
      _target_: maze.core.agent.random_policy.RandomPolicy
    exclude: ~
    manual_config: ~

Where to Go Next

Plain Python Training Example (high-level)

This tutorial demonstrates how to train an A2C agent with Maze in plain Python utilizing RunContext. In the process it introduces and explains some of Maze’ most important components and concepts, going from a plain vanilla setup to an increasingly customized configuration.

This is complementary to the article on low-level training in plain Python, which guides through the same setup (but without RunContext support).

Environment Setup

We will first prepare our environment for use with Maze. In order to use Maze’s parallelization capabilities, it is necessary to define a factory function that returns a MazeEnv of your environment. This is easily done for Gym environments:

def cartpole_env_factory():
    """ Env factory for the cartpole MazeEnv """
    # Registered gym environments can be instantiated first and then provided to GymMazeEnv:
    cartpole_env = gym.make("CartPole-v0")
    maze_env = GymMazeEnv(env=cartpole_env)

    # Another possibility is to supply the gym env string to GymMazeEnv directly:
    maze_env = GymMazeEnv(env="CartPole-v0")

    return maze_env

env = cartpole_env_factory()

If you have your own environment (that is not a gym.Env) you must transform it into a MazeEnv yourself, as is shown here, and have your factory return that. If it is a custom gym env it can be instantiated with our wrapper as shown above.

Algorithm Setup

We use A2C for this example. The algorithm_config for A2C can be found here. The hyperparameters will be supplied to Maze with an algorithm-dependent AlgorithmConfig object. The one for A2C is A2CAlgorithmConfig. We will use the default parameters, which can also be found here.

algorithm_config = A2CAlgorithmConfig(
    n_epochs=5,
    epoch_length=25,
    patience=15,
    critic_burn_in_epochs=0,
    n_rollout_steps=100,
    lr=0.0005,
    gamma=0.98,
    gae_lambda=1.0,
    policy_loss_coef=1.0,
    value_loss_coef=0.5,
    entropy_coef=0.00025,
    max_grad_norm=0.0,
    device='cpu',
    rollout_evaluator=RolloutEvaluator(
        eval_env=SequentialVectorEnv([cartpole_env_factory]),
        n_episodes=1,
        model_selection=None,
        deterministic=True
    )
)

Having defined our environment and configured our algorithm we’re ready to train:

rc = maze.api.run_context.RunContext(env=cartpole_env_factory, algorithm=algorithm_config)
rc.train()

Custom Model Setup

However, it can be advisable to create customized networks taking full advantage of the available data. For this reason Maze supports plugging in customized policy and value networks.

Our goal is to hence train an agent with A2C using customized policy and critic networks:

rc = maze.api.run_context.RunContext(
    env=cartpole_env_factory,
    algorithm=algorithm_config,
    policy=...,
    critic=...
)
rc.train()

Here we will pay special attention to emphasize the format required by Maze. When creating your own models, it is important to know three things:

  1. Maze works with dictionaries throughout, which means that arguments for the constructor and the input and return values of the forward method are dicts with user-defined keys. In a nutshell, instances of MazeEnv can have different steps indicating the currently active task. Each step is associated with a policy, so an environment with different steps can also have different policy. By default environments have only step 0. The required format for models is explained in more detail here.

  2. Policy networks and value network constructors have required arguments: for policy nets, these are obs_shapes and action_logit_dicts, for value nets, this is obs_shapes.

  3. Policies and critics are not passed directly, but via composer objects - i.e. classes of type BasePolicyComposer or CriticComposerInterface, respectively. Such composer classes are able to generate policy instances.

Policy Customization

To instantiate e.g. a ProbabilisticPolicyComposer, we require the following arguments:

  1. The policy network.

  2. A specification of the probability distribution as an instance of DistributionMapper.

  3. Dictionaries describing the action and observation spaces.

  4. The numbers of agents active in the corresponding steps.

  5. The IDs of substeps in which agents do not share the same networks.

Policy Network. First, let us create the latter as a simple linear mapping network with the required constraints:

class CartpolePolicyNet(nn.Module):
    """ Simple linear policy net for demonstration purposes. """
    def __init__(self, obs_shapes: Dict[str, Sequence[int]], action_logit_shapes: Dict[str, Sequence[int]]):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(
                in_features=obs_shapes['observation'][0],
                out_features=action_logit_shapes['action'][0]
            )
        )

    def forward(self, x_dict: Dict[str, torch.Tensor]) -> Dict[str, torch.Tensor]:
        # Since x_dict has to be a dictionary in Maze, we extract the input for the network.
        x = x_dict['observation']

        # Do the forward pass.
        logits = self.net(x)

        # Since the return value has to be a dict again, put the
        # forward pass result into a dict with the correct key.
        logits_dict = {'action': logits}

        return logits_dict

# Instantiate our custom policy net.
policy_net = CartpolePolicyNet(
    obs_shapes={'observation': env.observation_space.spaces['observation'].shape},
    action_logit_shapes={'action': (env.action_space.spaces['action'].n,)}
)

Optionally, we can wrap our policy network with a TorchModelBlock, which applies shape normalization (see ShapeNormalizationBlock):

policy_net = TorchModelBlock(
    in_keys='observation',
    out_keys='action',
    in_shapes=env.observation_space.spaces['observation'].shape,
    in_num_dims=[2],
    out_num_dims=2,
    net=policy_net
)

Since Maze offers the capability of supporting multiple actors, we need to map each policy_net to its corresponding actor ID. As we have only one, this mapping is trivial:

policy_networks = [policy_net]  # Alternative: {0: policy_net}

Policy Distribution. Initializing the proper probability distribution for the policy is rather easy with Maze. Simply provide the ~maze.distributions.distribution_mapper.DistributionMapper with the environment’s action space and you automatically get the proper distribution to use.

distribution_mapper = DistributionMapper(action_space=env.action_space, distribution_mapper_config={})

Optionally, you can specify a different distribution with the distribution_mapper_config argument. Using a CategoricalProbabilityDistribution for a discrete action space would be done with

distribution_mapper = DistributionMapper(
    action_space=action_space,
    distribution_mapper_config=[{
        "action_space": gym.spaces.Discrete,
        "distribution": "maze.distributions.categorical.CategoricalProbabilityDistribution"}])

Since the standard distribution taken by Maze for a discrete action space is a Categorical distribution anyway (as can be seen here), both definitions of distribution_mapper have the same result. For more information about the DistributionMapper, see Action Spaces and Distributions.


Policy Composer. The remaining arguments (action and observation space dictionaries, numbers of agents per step, ID of substeps with non-shared networks) are trivial in our case, as they can easily be derived from an instance of our environment. We can thus now set up a policy composer with our custom policy:

policy_composer = ProbabilisticPolicyComposer(
    action_spaces_dict=env.action_spaces_dict,
    observation_spaces_dict=env.observation_spaces_dict,
    distribution_mapper=distribution_mapper,
    networks=policy_networks,
    # We have only one agent and network, thus this is an empty list.
    substeps_with_separate_agent_nets=[],
    # We have only one step and one agent.
    agent_counts_dict={0: 1}
)

Once we have our policy composer, we are ready to train.

rc = maze.api.run_context.RunContext(
    env=cartpole_env_factory,
    algorithm=algorithm_config,
    policy=policy_composer
)
rc.train()

Critic Customization

Customizing the critic can be done quite similarly to the policy customization, the main difference being that we do not need a probability distribution.

First we define our value network.

class CartpoleValueNet(nn.Module):
    """ Simple linear value net for demonstration purposes. """
    def __init__(self, obs_shapes: Dict[str, Sequence[int]]):
        super().__init__()
        self.value_net = nn.Sequential(nn.Linear(in_features=obs_shapes['observation'][0], out_features=1))



    def forward(self, x_dict: Dict[str, torch.Tensor]) -> Dict[str, torch.Tensor]:
        """ Forward method. """
        # The same as for the policy can be said about the value
        # net: Inputs and outputs have to be dicts.
        x = x_dict['observation']

        value = self.value_net(x)

        value_dict = {'value': value}
        return value_dict

We instantiate our policy network and wrap it in a TorchModelBlock as done for the policy network.

value_networks = {
    0: TorchModelBlock(
        in_keys='observation', out_keys='value',
        in_shapes=observation_space.spaces['observation'].shape,
        in_num_dims=[2],
        out_num_dims=2,
        net=CartpoleValueNet(obs_shapes=env.observation_space.spaces['observation'].shape)
    )
}

Instantiating the Critic. This step is analogous to the instantiation of the policy above. In Maze, critics can have different forms (see Value Functions (Critics)). Here, we use a simple shared critic. Shared means that the same critic will be used for all sub-steps (in a multi-step setting) and all actors. Since we only have one actor in this example and are in a one-step setting, the TorchSharedStateCritic reduces to a vanilla StateCritic (aka a state-dependent value function).

critic_composer = SharedStateCriticComposer(
    observation_spaces_dict=env.observation_spaces_dict,
    agent_counts_dict={0: 1},
    networks=value_networks,
    stack_observations=True
)

Training

Having instantiated customized policy and critic composers we can train our model:

rc = run_context.RunContext(
    env=cartpole_env_factory,
    algorithm=algorithm_config,
    policy=policy_composer,
    critic=critic_composer
)
rc.train()

Distributed Training

If we want to train in a distributed manner, it is sufficient to pick the appropriate runner. For now, we might want to parallelize by distributing our environments over several processes. This can be done by utilizing local runners, whose utilization is straightforward:

algorithm_config.rollout_evaluator.eval_env = SubprocVectorEnv([cartpole_env_factory])
rc = run_context.RunContext(
    env=cartpole_env_factory,
    algorithm=algorithm_config,
    policy=policy_composer,
    critic=critic_composer,
    runner="local"
)
rc.train(n_epochs=1)

Evaluation

We can evaluate our performance with a RolloutEvaluator. In order for this to work with our environment, we wrap it with a LogStatsWrapper to ensure it has the logging capabilities required by the RolloutEvaluator.

evaluator = RolloutEvaluator(
    eval_env=LogStatsWrapper.wrap(cartpole_env_factory(), logging_prefix="eval"),
    n_episodes=3,
    model_selection=None
)
evaluator.evaluate(rc.policy)

Full Python Code

Here is the code without documentation for easier copy-pasting:

"""
Training and rollout of a policy in plain Python.
"""

from typing import Sequence, Dict

import gym
import torch
import torch.nn as nn

from maze.api.utils import RunMode
from maze.core.wrappers.maze_gym_env_wrapper import GymMazeEnv
from maze.train.parallelization.vector_env.subproc_vector_env import SubprocVectorEnv

from maze.utils.log_stats_utils import setup_logging

from maze.core.agent.torch_actor_critic import TorchActorCritic
from maze.train.trainers.a2c.a2c_trainer import A2C
from maze.train.trainers.common.model_selection.best_model_selection import BestModelSelection
from maze.train.parallelization.vector_env.sequential_vector_env import SequentialVectorEnv
from maze.train.trainers.common.evaluators.rollout_evaluator import RolloutEvaluator
from maze.core.wrappers.log_stats_wrapper import LogStatsWrapper
from maze.perception.models.critics.shared_state_critic_composer import SharedStateCriticComposer
from maze.train.trainers.a2c.a2c_algorithm_config import A2CAlgorithmConfig
from maze.api import run_context
from maze.distributions.distribution_mapper import DistributionMapper
from maze.perception.blocks.general.torch_model_block import TorchModelBlock
from maze.perception.models.policies import ProbabilisticPolicyComposer


def cartpole_env_factory() -> GymMazeEnv:
    """ Env factory for the cartpole MazeEnv """
    # Registered gym environments can be instantiated first and then provided to GymMazeEnv:
    cartpole_env = gym.make("CartPole-v0")
    maze_env = GymMazeEnv(env=cartpole_env)

    # Another possibility is to supply the gym env string to GymMazeEnv directly:
    # maze_env = GymMazeEnv(env="CartPole-v0")

    return maze_env


class CartpolePolicyNet(nn.Module):
    """ Simple linear policy net for demonstration purposes. """

    def __init__(self, obs_shapes: Dict[str, Sequence[int]], action_logit_shapes: Dict[str, Sequence[int]]):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(
                in_features=obs_shapes['observation'][0],
                out_features=action_logit_shapes['action'][0]
            )
        )

    def forward(self, x_dict: Dict[str, torch.Tensor]) -> Dict[str, torch.Tensor]:
        # Since x_dict has to be a dictionary in Maze, we extract the input for the network.
        x = x_dict['observation']

        # Do the forward pass.
        logits = self.net(x)

        # Since the return value has to be a dict again, put the forward pass result into a dict with the
        # correct key.
        logits_dict = {'action': logits}

        return logits_dict


class CartpoleValueNet(nn.Module):
    """ Simple linear value net for demonstration purposes. """
    def __init__(self, obs_shapes: Dict[str, Sequence[int]]):
        super().__init__()
        self.value_net = nn.Sequential(nn.Linear(in_features=obs_shapes['observation'][0], out_features=1))

    def forward(self, x_dict: Dict[str, torch.Tensor]) -> Dict[str, torch.Tensor]:
        """ Forward method. """
        # The same as for the policy can be said about the value net. Inputs and outputs have to be dicts.
        x = x_dict['observation']

        value = self.value_net(x)

        value_dict = {'value': value}
        return value_dict


def train(n_epochs: int) -> int:
    """
    Trains agent in pure Python.

    :param n_epochs: Number of epochs to train.

    :return: 0 if successful.

    """

    # Environment setup
    # -----------------

    env = cartpole_env_factory()

    # Algorithm setup
    # ---------------

    algorithm_config = A2CAlgorithmConfig(
        n_epochs=5,
        epoch_length=25,
        patience=15,
        critic_burn_in_epochs=0,
        n_rollout_steps=100,
        lr=0.0005,
        gamma=0.98,
        gae_lambda=1.0,
        policy_loss_coef=1.0,
        value_loss_coef=0.5,
        entropy_coef=0.00025,
        max_grad_norm=0.0,
        device='cpu',
        rollout_evaluator=RolloutEvaluator(
            eval_env=SequentialVectorEnv([cartpole_env_factory]),
            n_episodes=1,
            model_selection=None,
            deterministic=True
        )
    )

    # Custom model setup
    # ------------------

    # Policy customization
    # ^^^^^^^^^^^^^^^^^^^^

    # Policy network.
    policy_net = CartpolePolicyNet(
        obs_shapes={'observation': env.observation_space.spaces['observation'].shape},
        action_logit_shapes={'action': (env.action_space.spaces['action'].n,)}
    )
    policy_networks = [policy_net]

    # Policy distribution.
    distribution_mapper = DistributionMapper(action_space=env.action_space, distribution_mapper_config={})

    # Policy composer.
    policy_composer = ProbabilisticPolicyComposer(
        action_spaces_dict=env.action_spaces_dict,
        observation_spaces_dict=env.observation_spaces_dict,
        # Derive distribution from environment's action space.
        distribution_mapper=distribution_mapper,
        networks=policy_networks,
        # We have only one agent and network, thus this is an empty list.
        substeps_with_separate_agent_nets=[],
        # We have only one step and one agent.
        agent_counts_dict={0: 1}
    )

    # Critic customization
    # ^^^^^^^^^^^^^^^^^^^^

    # Value networks.
    value_networks = {
        0: TorchModelBlock(
            in_keys='observation', out_keys='value',
            in_shapes=env.observation_space.spaces['observation'].shape,
            in_num_dims=[2],
            out_num_dims=2,
            net=CartpoleValueNet({'observation': env.observation_space.spaces['observation'].shape})
        )
    }

    # Critic composer.
    critic_composer = SharedStateCriticComposer(
        observation_spaces_dict=env.observation_spaces_dict,
        agent_counts_dict={0: 1},
        networks=value_networks,
        stack_observations=True
    )

    # Training
    # ^^^^^^^^

    rc = run_context.RunContext(
        env=cartpole_env_factory,
        algorithm=algorithm_config,
        policy=policy_composer,
        critic=critic_composer,
        runner="dev"
    )
    rc.train(n_epochs=n_epochs)

    # Distributed training
    # ^^^^^^^^^^^^^^^^^^^^

    algorithm_config.rollout_evaluator.eval_env = SubprocVectorEnv([cartpole_env_factory])
    rc = run_context.RunContext(
        env=cartpole_env_factory,
        algorithm=algorithm_config,
        policy=policy_composer,
        critic=critic_composer,
        runner="local"
    )
    rc.train(n_epochs=n_epochs)

    # Evaluation
    # ^^^^^^^^^^

    print("-----------------")
    evaluator = RolloutEvaluator(
        eval_env=LogStatsWrapper.wrap(cartpole_env_factory(), logging_prefix="eval"),
        n_episodes=1,
        model_selection=None
    )
    evaluator.evaluate(rc.policy)

    return 0


if __name__ == '__main__':
    train(n_epochs=1)

Plain Python Training Example (low-level)

This tutorial demonstrates how to train an A2C agent with Maze in plain Python without utilizing RunContext. In the process it introduces and explains some of Maze’ most important components and concepts.

This is complementary to the article on high-level training in plain Python, which guides through the same setup (but with RunContext support).

Environment Setup

We will first prepare our environment for use with Maze. In order to use Maze’s parallelization capabilities, it is necessary to define a factory function that returns a MazeEnv of your environment. This is easily done for Gym environments:

def cartpole_env_factory():
    """ Env factory for the cartpole MazeEnv """
    # Registered gym environments can be instantiated first and then provided to GymMazeEnv:
    cartpole_env = gym.make("CartPole-v0")
    maze_env = GymMazeEnv(env=cartpole_env)

    # Another possibility is to supply the gym env string to GymMazeEnv directly:
    maze_env = GymMazeEnv(env="CartPole-v0")

    return maze_env

If you have your own environment (that is not a gym.Env) you must transform it into a MazeEnv yourself, as is shown here, and have your factory return that. If it is a custom gym env it can be instantiated with our wrapper as shown above.

We instantiate one environment. This will be used for convenient access to observation and action spaces later.

env = cartpole_env_factory()
observation_space = env.observation_space
action_space = env.action_space

Model Setup

Now that the environment setup is done, let us develop the policy and value networks that will be used. We will pay special attention to emphasize the format required by Maze. When creating your own models, it is important to know two things:

  1. Maze works with dictionaries throughout, which means that arguments for the constructor and the input and return values of the forward method are dicts with user-defined keys.

  2. Policy networks and value network constructors have required arguments: for policy nets, these are obs_shapes and action_logit_dicts, for value nets, this is obs_shapes.

The required format is explained in more detail here. With this in mind, let us create a simple linear mapping network with the required constraints:

class CartpolePolicyNet(nn.Module):
    """ Simple linear policy net for demonstration purposes. """
    def __init__(self, obs_shapes: Sequence[int], action_logit_shapes: Sequence[int]):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(in_features=obs_shapes[0], out_features=action_logit_shapes[0])
        )

    def forward(self, x_dict: Dict[str, torch.Tensor]) -> Dict[str, torch.Tensor]:
        # Since x_dict has to be a dictionary in Maze, we extract the input for the network.
        x = x_dict['observation']

        # Do the forward pass.
        logits = self.net(x)

        # Since the return value has to be a dict again, put the
        # forward pass result into a dict with the  correct key.
        logits_dict = {'action': logits}

        return logits_dict

# Instantiate our custom policy net.
policy_net = CartpolePolicyNet(
    obs_shapes=env.observation_space.spaces['observation'].shape,
    action_logit_shapes=(env.action_space.spaces['action'].n,)
)

and

class CartpoleValueNet(nn.Module):
    """ Simple linear value net for demonstration purposes. """
    def __init__(self, obs_shapes: Sequence[int]):
        super().__init__()
        self.value_net = nn.Sequential(nn.Linear(in_features=obs_shapes[0], out_features=1))


    def forward(self, x_dict: Dict[str, torch.Tensor]) -> Dict[str, torch.Tensor]:
        """ Forward method. """
        # The same as for the policy can be said about the value
        # net: Inputs and outputs have to be dicts.
        x = x_dict['observation']

        value = self.value_net(x)

        value_dict = {'value': value}
        return value_dict

Policy Setup

For a policy, we need a parametrization for the policy (provided by the policy network) and a probability distribution we can sample from. We will subsequently define and instantiate each of these.

Policy Network

Instantiate a policy with the correct shapes of observation and action spaces.

policy_net = WrappedCartpolePolicyNet(
    obs_shapes=observation_space.spaces['observation'].shape,
    action_logit_shapes=(action_space.spaces['action'].n,))

We can use one of Mazes capabilities, shape normalization (see ShapeNormalizationBlock), with these models by wrapping them with the TorchModelBlock.

maze_wrapped_policy_net = TorchModelBlock(
    in_keys='observation', out_keys='action',
    in_shapes=observation_space.spaces['observation'].shape, in_num_dims=[2],
    out_num_dims=2, net=policy_net)

Since Maze offers the capability of supporting multiple actors, we need to map each policy_net to its corresponding actor ID. As we have only one policy, this is a trivial mapping:

policy_networks = {0: maze_wrapped_policy_net}

Policy Distribution

Initializing the proper probability distribution for the policy is rather easy with Maze. Simply provide the DistributionMapper with the action space and you automatically get the proper distribution to use.

distribution_mapper = DistributionMapper(action_space=action_space, distribution_mapper_config={})

Optionally, you can specify a different distribution with the distribution_mapper_config argument. Using a Categorical distribution for a discrete action space would be done with

distribution_mapper = DistributionMapper(
    action_space=action_space,
    distribution_mapper_config=[{
        "action_space": gym.spaces.Discrete,
        "distribution": "maze.distributions.categorical.CategoricalProbabilityDistribution"}])

Since the standard distribution taken by Maze for a discrete action space is a Categorical distribution anyway (as can be seen here), both definitions of the distribution_mapper have the same result. For more information about the DistributionMapper, see Action Spaces and Distributions.

Instantiating the Policy

We have both necessary ingredients to define a policy: a parametrization, given by the policy network, and a distribution. With these, we can instantiate a policy. This is done with the TorchPolicy class:

torch_policy = TorchPolicy(networks=policy_networks,
                           distribution_mapper=distribution_mapper,
                           device='cpu')

Critic Setup

The setup of a critic (or value function) is similar to the setup of a policy, the main difference being that we do not need a probability distribution.

Value Network

value_net = WrappedCartpoleValueNet(obs_shapes=observation_space.spaces['observation'].shape)

maze_wrapped_value_net = TorchModelBlock(
    in_keys='observation', out_keys='value',
    in_shapes=observation_space.spaces['observation'].shape, in_num_dims=[2],
    out_num_dims=2, net=value_net)

value_networks = {0: maze_wrapped_value_net}

Instantiating the Critic

This step is analogous to the instantiation of the policy above. In Maze, critics can have different forms (see Value Functions (Critics)). Here, we use a simple shared critic. Shared means that the same critic will be used for all sub-steps (in a multi-step setting) and all actors. Since we only have one actor in this example and are in a one-step setting, the TorchSharedStateCritic reduces to a vanilla StateCritic (aka a state-dependent value function).

torch_critic = TorchSharedStateCritic(networks=value_networks, num_policies=1, device='cpu')

Initializing the ActorCritic Model.

In Maze, policies and critics are encapsulated by an ActorCritic model. Details about this can be found in Actor-Critics. We will use A2C to train the cartpole env. The correct ActorCritic model to use for A2C is the TorchActorCritic:

actor_critic_model = TorchActorCritic(policy=torch_policy, critic=torch_critic, device='cpu')

Trainer Setup

The last steps will be the instantiations of the algorithm and corresponding trainer. We use A2C for this example. The algorithm_config for A2C can be found here. The hyperparameters will be supplied to Maze with an algorithm-dependent AlgorithmConfig object. The one for A2C is A2CAlgorithmConfig. We will use the default parameters, which can also be found here.

algorithm_config = A2CAlgorithmConfig(
    n_epochs=5,
    epoch_length=25,
    patience=15,
    critic_burn_in_epochs=0,
    n_rollout_steps=100,
    lr=0.0005,
    gamma=0.98,
    gae_lambda=1.0,
    policy_loss_coef=1.0,
    value_loss_coef=0.5,
    entropy_coef=0.00025,
    max_grad_norm=0.0,
    device='cpu',
    rollout_evaluator=RolloutEvaluator(
        eval_env=SequentialVectorEnv([cartpole_env_factory]),
        n_episodes=1,
        model_selection=None,
        deterministic=True
    )
)

In order to use the distributed trainers, we create a vector environment (i.e., multiple environment instances encapsulated to be stepped simultaneously) using the environment factory function:

train_envs = SequentialVectorEnv(
    [cartpole_env_factory for _ in range(2)], logging_prefix="train")
eval_envs = SequentialVectorEnv(
    [cartpole_env_factory for _ in range(2)], logging_prefix="eval")

(In this case, we create sequential vector environments, i.e. all environment instances are located in the main process and stepped sequentially. When we are ready to scale the training, we might want to use e.g. sub-process distributed vector environments.)

For this example, we want to save the parameters of the best model in terms of mean achieved reward. This is done with the BestModelSelection class, an instance of which will be provided to the trainer.

model_selection = BestModelSelection(dump_file="params.pt", model=actor_critic_model)

We can now instantiate an A2C trainer:

a2c_trainer = A2C(
    env=train_envs,
    algorithm_config=algorithm_config,
    model=actor_critic_model,
    model_selection=model_selection,
    evaluator=algorithm_config.rollout_evaluator
)

Train the Agent

Before starting the training, we will enable logging by calling

log_dir = '.'
setup_logging(job_config=None, log_dir=log_dir)

Now, we can train the agent.

a2c_trainer.train()

To get an out-of sample estimate of our performance, evaluate on the evaluation envs:

a2c_trainer.evaluate(deterministic=False, repeats=1)

Full Python Code

Here is the code without documentation for easier copy-pasting:

""" Rollout of a policy in plain Python. """

from typing import Dict, Sequence

import gym
import torch
import torch.nn as nn

from maze.core.agent.torch_actor_critic import TorchActorCritic
from maze.core.agent.torch_policy import TorchPolicy
from maze.core.agent.torch_state_critic import TorchSharedStateCritic
from maze.core.rollout.rollout_generator import RolloutGenerator
from maze.core.wrappers.maze_gym_env_wrapper import GymMazeEnv
from maze.distributions.distribution_mapper import DistributionMapper
from maze.perception.blocks.general.torch_model_block import TorchModelBlock
from maze.train.parallelization.vector_env.sequential_vector_env import SequentialVectorEnv
from maze.train.trainers.a2c.a2c_algorithm_config import A2CAlgorithmConfig
from maze.train.trainers.a2c.a2c_trainer import A2C
from maze.train.trainers.common.evaluators.rollout_evaluator import RolloutEvaluator
from maze.train.trainers.common.model_selection.best_model_selection import BestModelSelection
from maze.utils.log_stats_utils import setup_logging


# Environment Setup
# =================

# Environment Factory
# -------------------
# Define environment factory
def cartpole_env_factory():
    """ Env factory for the cartpole MazeEnv """
    # Registered gym environments can be instantiated first and then provided to GymMazeEnv:
    cartpole_env = gym.make("CartPole-v0")
    maze_env = GymMazeEnv(env=cartpole_env)

    # Another possibility is to supply the gym env string to GymMazeEnv directly:
    maze_env = GymMazeEnv(env="CartPole-v0")

    return maze_env


# Model Setup
# ===========
# Policy Network
# --------------
class CartpolePolicyNet(nn.Module):
    """ Simple linear policy net for demonstration purposes. """

    def __init__(self, obs_shapes: Dict[str, Sequence[int]], action_logit_shapes: Dict[str, Sequence[int]]):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(in_features=obs_shapes['observation'][0],
                      out_features=action_logit_shapes['action'][0])
        )

    def forward(self, x_dict: Dict[str, torch.Tensor]) -> Dict[str, torch.Tensor]:
        # Since x_dict has to be a dictionary in Maze, we extract the input for the network.
        x = x_dict['observation']

        # Do the forward pass.
        logits = self.net(x)

        # Since the return value has to be a dict again, put the forward pass result into a dict with the
        # correct key.
        logits_dict = {'action': logits}

        return logits_dict


# Value Network
# -------------
class CartpoleValueNet(nn.Module):
    """ Simple linear value net for demonstration purposes. """

    def __init__(self, obs_shapes: Dict[str, Sequence[int]]):
        super().__init__()
        self.value_net = nn.Sequential(nn.Linear(in_features=obs_shapes['observation'][0], out_features=1))

    def forward(self, x_dict: Dict[str, torch.Tensor]) -> Dict[str, torch.Tensor]:
        """ Forward method. """
        # The same as for the policy can be said about the value net. Inputs and outputs have to be dicts.
        x = x_dict['observation']

        value = self.value_net(x)

        value_dict = {'value': value}
        return value_dict


def train(n_epochs):
    # Instantiate one environment. This will be used for convenient access to observation
    # and action spaces.
    env = cartpole_env_factory()
    observation_space = env.observation_space
    action_space = env.action_space

    # Policy Setup
    # ------------

    # Policy Network
    # ^^^^^^^^^^^^^^
    # Instantiate policy with the correct shapes of observation and action spaces.
    policy_net = CartpolePolicyNet(
        obs_shapes={'observation': observation_space.spaces['observation'].shape},
        action_logit_shapes={'action': (action_space.spaces['action'].n,)})

    maze_wrapped_policy_net = TorchModelBlock(
        in_keys='observation', out_keys='action',
        in_shapes=observation_space.spaces['observation'].shape, in_num_dims=[2],
        out_num_dims=2, net=policy_net)

    policy_networks = {0: maze_wrapped_policy_net}

    # Policy Distribution
    # ^^^^^^^^^^^^^^^^^^^
    distribution_mapper = DistributionMapper(
        action_space=action_space,
        distribution_mapper_config={})

    # Optionally, you can specify a different distribution with the distribution_mapper_config argument. Using a
    # Categorical distribution for a discrete action space would be done via
    distribution_mapper = DistributionMapper(
        action_space=action_space,
        distribution_mapper_config=[{
            "action_space": gym.spaces.Discrete,
            "distribution": "maze.distributions.categorical.CategoricalProbabilityDistribution"}])

    # Instantiating the Policy
    # ^^^^^^^^^^^^^^^^^^^^^^^^
    torch_policy = TorchPolicy(networks=policy_networks, distribution_mapper=distribution_mapper, device='cpu')

    # Value Function Setup
    # --------------------

    # Value Network
    # ^^^^^^^^^^^^^
    value_net = CartpoleValueNet(obs_shapes={'observation': observation_space.spaces['observation'].shape})

    maze_wrapped_value_net = TorchModelBlock(
        in_keys='observation', out_keys='value',
        in_shapes=observation_space.spaces['observation'].shape, in_num_dims=[2],
        out_num_dims=2, net=value_net)

    value_networks = {0: maze_wrapped_value_net}

    # Instantiate the Value Function
    # ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    torch_critic = TorchSharedStateCritic(networks=value_networks, obs_spaces_dict=env.observation_spaces_dict,
                                          device='cpu', stack_observations=False)

    # Initializing the ActorCritic Model.
    # -----------------------------------
    actor_critic_model = TorchActorCritic(policy=torch_policy, critic=torch_critic, device='cpu')

    # Instantiating the Trainer
    # =========================

    algorithm_config = A2CAlgorithmConfig(
        n_epochs=n_epochs,
        epoch_length=25,
        patience=15,
        critic_burn_in_epochs=0,
        n_rollout_steps=100,
        lr=0.0005,
        gamma=0.98,
        gae_lambda=1.0,
        policy_loss_coef=1.0,
        value_loss_coef=0.5,
        entropy_coef=0.00025,
        max_grad_norm=0.0,
        device='cpu',
        rollout_evaluator=RolloutEvaluator(
            eval_env=SequentialVectorEnv([cartpole_env_factory]),
            n_episodes=1,
            model_selection=None,
            deterministic=True
        )
    )

    # Distributed Environments
    # ------------------------
    # In order to use the distributed trainers, the previously created env factory is supplied to one of Maze's
    # distribution classes:
    train_envs = SequentialVectorEnv([cartpole_env_factory for _ in range(2)], logging_prefix="train")
    eval_envs = SequentialVectorEnv([cartpole_env_factory for _ in range(2)], logging_prefix="eval")

    # Initialize best model selection.
    model_selection = BestModelSelection(dump_file="params.pt", model=actor_critic_model)

    a2c_trainer = A2C(rollout_generator=RolloutGenerator(train_envs),
                      evaluator=algorithm_config.rollout_evaluator,
                      algorithm_config=algorithm_config,
                      model=actor_critic_model,
                      model_selection=model_selection)

    # Train the Agent
    # ===============
    # Before starting the training, we will enable logging by calling
    log_dir = '.'
    setup_logging(job_config=None, log_dir=log_dir)

    # Now, we can train the agent.
    a2c_trainer.train()

    return 0


if __name__ == '__main__':
    train(n_epochs=5)

Tensorboard and Command Line Logging

This page gives a brief overview of the Tensorboard and command line logging facilities of Maze. We will show examples based on the cutting-2D Maze environment to make things a bit more interesting.

To understand the underlying concepts we recommend to read the sections on event and KPI logging as well as on the Maze event system.

Tensorboard Logging

To watch the training progress with Tensorboard start it by running:

tensorboard --logdir outputs/

and view it with your browser at http://localhost:6006/.

You will get an output similar to the one shown in the image below.

_images/tb_collapsed.png

To keep everything organized and avoid having to browse through tons of pages we group the contained items into semantically connected sections:

  • Since Maze allows you to use different environments for training and evaluation, each logging section has a train_ or eval_ prefix to show if the corresponding stats were logged as part of the training or the evaluation environment.

  • The BaseEnvEvents sections (i.e., eval_BaseEnvEvents and train_BaseEnvEvents contain general statistics such as rewards or step counts. This section is always present, independent of the environment used.

  • Other sections are specific to the environment used. In the example above, these are the CuttingEvents and the InventoryEvents.

  • In addition, we get one additional section containing stats of the trainer used, called train_NameOfTrainerEvents. It contains statistics such as policy loss, gradient norm or value loss. This section is not present for the evaluation environment.

The gallery below shows some additional useful examples and features of the Maze Tensorboard log (click the images to display them in large).

Logging of component specific events
in the SCALARS tab.
(Useful for understanding the environment)
Logging of the training command and the complete
hydra job config in the TEXT tab.
(Useful for reproducing experiments)
_images/tb_events1.png
_images/tb_text.png
in the IMAGE tab.
(Useful for understanding the agent’s behaviour)
in the DISTRIBUTIONS and HISTOGRAMS tab.
(Useful for analysing observations)
_images/tb_images.png
_images/tb_distributions.png

Command Line Logging

Whenever you start a training run you will also get a command line output similar to the one shown below. Analogously to the Tensorboard log, Maze distinguishes between train and eval outputs and groups the items into semantically connected output blocks.

 step|path                                                                        |               value
=====|============================================================================|====================
    1|train     MultiStepActorCritic..time_rollout          ······················|               1.091
    1|train     MultiStepActorCritic..learning_rate         ······················|               0.000
    1|train     MultiStepActorCritic..policy_loss           0                     |              -0.000
    1|train     MultiStepActorCritic..policy_grad_norm      0                     |               0.001
    1|train     MultiStepActorCritic..policy_entropy        0                     |               1.593
    1|train     MultiStepActorCritic..policy_loss           1                     |              -0.000
    1|train     MultiStepActorCritic..policy_grad_norm      1                     |               0.008
    1|train     MultiStepActorCritic..policy_entropy        1                     |               0.295
    1|train     MultiStepActorCritic..critic_value          0                     |              -0.199
    1|train     MultiStepActorCritic..critic_value_loss     0                     |             116.708
    1|train     MultiStepActorCritic..critic_grad_norm      0                     |               0.500
    1|train     MultiStepActorCritic..time_update           ······················|               1.642
    1|train     DiscreteActionEvents  action                substep_0/piece_idx   |  [len:4000, μ:54.8]
    1|train     BaseEnvEvents         reward                median_step_count     |             200.000
    1|train     BaseEnvEvents         reward                mean_step_count       |             200.000
    1|train     BaseEnvEvents         reward                total_step_count      |            4000.000
    1|train     BaseEnvEvents         reward                total_episode_count   |              20.000
    1|train     BaseEnvEvents         reward                episode_count         |              20.000
    1|train     BaseEnvEvents         reward                std                   |               1.465
    1|train     BaseEnvEvents         reward                mean                  |             -71.950
    1|train     BaseEnvEvents         reward                min                   |             -75.000
    1|train     BaseEnvEvents         reward                max                   |             -70.000
    1|train     DiscreteActionEvents  action                substep_1/order       |   [len:4000, μ:0.5]
    1|train     DiscreteActionEvents  action                substep_1/rotation    |   [len:4000, μ:0.5]
    1|train     InventoryEvents       piece_replenished     mean_episode_total    |              71.950
    1|train     InventoryEvents       pieces_in_inventory   step_max              |             163.000
    1|train     InventoryEvents       pieces_in_inventory   step_mean             |              69.946
    1|train     CuttingEvents         valid_cut             mean_episode_total    |             200.000
    1|train     BaseEnvEvents         kpi                   max/raw_piece_usage_..|               0.375
    1|train     BaseEnvEvents         kpi                   min/raw_piece_usage_..|               0.350
    1|train     BaseEnvEvents         kpi                   std/raw_piece_usage_..|               0.007
    1|train     BaseEnvEvents         kpi                   mean/raw_piece_usage..|               0.360
Time required for epoch: 19.43s
Update epoch - 1
 step|path                                                                        |               value
=====|============================================================================|====================
    2|eval      DiscreteActionEvents  action                substep_0/piece_idx   |   [len:800, μ:53.2]
    2|eval      BaseEnvEvents         reward                median_step_count     |             200.000
    2|eval      BaseEnvEvents         reward                mean_step_count       |             200.000
    2|eval      BaseEnvEvents         reward                total_step_count      |            1600.000
    2|eval      BaseEnvEvents         reward                total_episode_count   |               8.000
    2|eval      BaseEnvEvents         reward                episode_count         |               4.000
    2|eval      BaseEnvEvents         reward                std                   |               1.414
    2|eval      BaseEnvEvents         reward                mean                  |             -71.000
    2|eval      BaseEnvEvents         reward                min                   |             -73.000
    2|eval      BaseEnvEvents         reward                max                   |             -69.000
    2|eval      DiscreteActionEvents  action                substep_1/order       |    [len:800, μ:0.5]
    2|eval      DiscreteActionEvents  action                substep_1/rotation    |    [len:800, μ:0.5]
    2|eval      InventoryEvents       piece_replenished     mean_episode_total    |              71.000
    2|eval      InventoryEvents       pieces_in_inventory   step_max              |             145.000
    2|eval      InventoryEvents       pieces_in_inventory   step_mean             |              68.031
    2|eval      CuttingEvents         valid_cut             mean_episode_total    |             200.000
    2|eval      BaseEnvEvents         kpi                   max/raw_piece_usage_..|               0.365
    2|eval      BaseEnvEvents         kpi                   min/raw_piece_usage_..|               0.345
    2|eval      BaseEnvEvents         kpi                   std/raw_piece_usage_..|               0.007
    2|eval      BaseEnvEvents         kpi                   mean/raw_piece_usage..|               0.355

Where to Go Next

Event and KPI Logging

Monitoring only standard metrics such as reward or episode step count is not always sufficiently informative about the agent’s behaviour and the problem at hand. To tackle this issue and to enable better inspection and logging tools for both, agents and environments, we introduce an event and key performance indicator (KPI) logging system. It is based on the more general event system and allows us to log and monitor environment specific metrics.

The figure below shows a conceptual overview of the logging system. In the remainder of this page we will go through the components in more detail.

_images/logging_overview.png

Events

In this section we describe the event logging system from an usage perspective. To understand how this is embedded in the broader context of a Maze environment we refer to the environments and KPI section of our step by step tutorial as well as the dedicated section on the underlying event system.

In general, events can be define for any component involved in the RL process (e.g., environments, agents, …). They get fired by the respective component whenever they occur during the agent environment interaction loop. For logging, events are collected and aggregated via the LogStatsWrapper.

To provide full flexibility Maze allows to customize which statistics are computed at which stage of the aggregation process via event decorators (step, episode, epoch). The code snipped below contains an example for an event called invalid_piece_selected borrowed from the cutting 2D tutorial.

class CuttingEvents(ABC):
    """Events related to the cutting process."""

    @define_epoch_stats(np.mean, output_name="mean_episode_total")
    @define_episode_stats(sum)
    @define_step_stats(len)
    def invalid_piece_selected(self):
        """An invalid piece is selected for cutting."""

The snippet defines the following statistics aggregation hierarchy:

Step Statistics [@define_step_stats(len)]: in each environment step events \(e_i\) are collected as lists of events \(\{e_i\}\). The function len associated with the decorator counts how often such an event occurred in the current step \(Stats_{Step}=|\{e_i\}|\) (e.g., length of invalid_piece_selected event list).

Episode Statistics [@define_episode_stats(sum)]: defines how the \(S\) step statistics should be aggregated to episode statistics (e.g., by simply summing them up: \(Stats_{Episode}=\sum^S Stats_{Step})\)

Epoch Statistics [@define_epoch_stats(np.mean, output_name="mean_episode_total")]: a training epoch consists of N episodes. This stage defines how these N episode statistics are averaged to epoch statistics (e.g., the mean of the contained episodes: \(Stats_{Epoch}=(\sum^N Stats_{Episode})/N\)).

The figure below provides a visual summary of the entire event statistics aggregation hierarchy as well as its relation to KPIs which will be explained in the next section. In Tensorboard and on the command line these events get then logged in dedicated sections (e.g., as CuttingEvents).

_images/logging_hierarchy.png

Key Performance Indicators (KPIs)

In applied RL settings the reward is not always the target metric we aim at optimizing from an economical perspective. Sometimes rewards are heavily shaped to get the agent to learn the right behaviour. This makes it hard to interpret for humans. For such cases Maze supports computing and logging of additional Key Performance Indicators (KPIs) along with the reward via the KpiCalculator implemented as a part of the CoreEnv (as reward KPIs are logged as BaseEnvEvents).

KPIs are in contrast to events computed in an aggregated form at the end of an episode triggered by the reset() method of the LogStatsWrapper. This is why we can compute them in a normalized fashion (e.g., dived by the total number of steps in an episode). Conceptually KPIs life on the same level as episode statistics in the logging hierarchy (see figure above).

For further details on how to implement a concrete KPI calculator we refer to the KPI section of our tutorial.

Plain Python Configuration

When working with the CLI and Hydra configs all components necessary for logging are automatically instantiated under the hood. In case you would like to test or run your logging setup directly from Python you can start with the snippet below.

from docs.tutorial_maze_env.part04_events.env.maze_env import maze_env_factory
from maze.utils.log_stats_utils import SimpleStatsLoggingSetup
from maze.core.wrappers.log_stats_wrapper import LogStatsWrapper

# init maze environment
env = maze_env_factory(max_pieces_in_inventory=200, raw_piece_size=[100, 100],
                       static_demand=(30, 15))

# wrap environment with logging wrapper
env = LogStatsWrapper(env, logging_prefix="main")

# register a console writer and connect the writer to the statistics logging system
with SimpleStatsLoggingSetup(env):
    # reset environment and run interaction loop
    obs = env.reset()
    for i in range(15):
        action = env.action_space.sample()
        obs, reward, done, info = env.step(action)

To get access to event and KPI logging we need to wrap the environment with the LogStatsWrapper. To simplify the statistics logging setup we rely on the SimpleStatsLoggingSetup helper class.

When running the script you will get an output as shown below. Note that statistics of both, events and KPIs, are printed along with default reward or action statistics.

 step|path                                                                      |               value
=====|==========================================================================|====================
    1|main    DiscreteActionEvents  action                substep_0/order       |     [len:15, μ:0.5]
    1|main    DiscreteActionEvents  action                substep_0/piece_idx   |    [len:15, μ:82.3]
    1|main    DiscreteActionEvents  action                substep_0/rotation    |     [len:15, μ:0.7]
    1|main    BaseEnvEvents         reward                median_step_count     |              15.000
    1|main    BaseEnvEvents         reward                mean_step_count       |              15.000
    1|main    BaseEnvEvents         reward                total_step_count      |              15.000
    1|main    BaseEnvEvents         reward                total_episode_count   |               1.000
    1|main    BaseEnvEvents         reward                episode_count         |               1.000
    1|main    BaseEnvEvents         reward                std                   |               0.000
    1|main    BaseEnvEvents         reward                mean                  |             -29.000
    1|main    BaseEnvEvents         reward                min                   |             -29.000
    1|main    BaseEnvEvents         reward                max                   |             -29.000
    1|main    InventoryEvents       piece_replenished     mean_episode_total    |               3.000
    1|main    InventoryEvents       pieces_in_inventory   step_max              |             200.000
    1|main    InventoryEvents       pieces_in_inventory   step_mean             |             200.000
    1|main    CuttingEvents         invalid_cut           mean_episode_total    |              14.000
    1|main    InventoryEvents       piece_discarded       mean_episode_total    |               2.000
    1|main    CuttingEvents         valid_cut             mean_episode_total    |               1.000
    1|main    BaseEnvEvents         kpi                   max/raw_piece_usage_..|               0.000
    1|main    BaseEnvEvents         kpi                   min/raw_piece_usage_..|               0.000
    1|main    BaseEnvEvents         kpi                   std/raw_piece_usage_..|               0.000
    1|main    BaseEnvEvents         kpi                   mean/raw_piece_usage..|               0.000

Where to Go Next

Action Distribution Visualization

There are situations where it turns out to be extremely useful to watch the evolution of an agent’s sampling behaviour throughout the training process. Looking at the action sampling distribution often provides a first intuition about the agent’s behaviour without the need to look at individual rollouts.

However, most importantly it is a great debugging tool immediately revealing if:

  • the weights of the policy collapsed during training (e.g., the agent starts sampling always the same actions even though this does not make sense for the environment at hand).

  • observations are properly normalized and the weights of the policy are initialized accordingly to result in a healthy initial sampling behaviour of the untrained model (e.g., each discrete action is taken a similar number of times when starting training).

  • biasing the weights of the policy output layer results in the expected sampling behaviour (e.g., initially sampling an action twice as often as the remaining ones).

  • the agents actually starts learning (i.e., the sampling distributions changes throughout the training epochs).

To activate action logging you only have to add the MazeEnvMonitoringWrapper to your environment wrapper stack in your yaml config:

# @package wrappers
maze.core.wrappers.monitoring_wrapper.MazeEnvMonitoringWrapper:
    observation_logging: false
    action_logging: true
    reward_logging: false

Action sampling distributions are then visualized on a per-epoch basis in the IMAGES tab of Tensorboard. By using the slider above the images you can step through the training epochs and see how the sampling distribution evolves over time.

Discrete and Multi Binary Actions

Each action space has a dedicated visualization assigned. Discrete and multi-binary action spaces are visualized via histograms. The example below shows an action sampling distribution for the discrete version of LunarLander-v2. The indices on the x-axis correspond to the available actions:

  • Action \(a_0\) - do nothing

  • Action \(a_1\) - fire left orientation engine

  • Action \(a_2\) - fire main engine

  • Action \(a_3\) - fire right orientation engine

_images/tb_discrete_action.png

We can see that action \(a_2\) (fire main engine) is taken most often, which is reasonable for this environment.

Continuous Actions

Continuous actions (Box spaces) are visualized via violin plots. The example below shows an action sampling distribution for LunarLanderContinuous-v2. The indices on the x-axis correspond to the available actions:

  • Action \(a_1\) - controls the main engine:

    • \(a_1 \in [-1, 0]\): off

    • \(a_1 \in (0, 1]\) throttle from 50% to 100% power (can’t work with less than 50%).

  • Action \(a_2\) - controls the orientation engines:

    • \(a_2 \in [-1.0, -0.5]\): fire left engine

    • \(a_2 \in [0.5, 1.0]\): fire right engine

    • \(a_2 \in (-0.5, 0.5)\): off

_images/tb_continuous_action.png

For the first action, corresponding to the main engine, values closer to 1.0 are sampled more often which is similar to the discrete case above.

Where to Go Next

Observation Logging

Maze provides the following options to monitor and inspect the observations presented to your policy and value networks throughout the training process:

Warning

Observation visualization and logging are supported as opt-in features via dedicated wrappers. We recommend to use them only for debugging and inspection purposes. Once everything is on track and training works as expected we suggest to remove (deactivate) the wrappers especially when dealing with environments with large observations. If you forget to remove it training might get slow and the memory consumption of Tensorboard might explode!

Observation Distribution Visualization

Watching the evolution of distributions and value ranges of observations is especially useful for debugging your experiments and training runs as it reveals if:

  • observations stay within an expected value range.

  • observation normalization is applied correctly.

  • observations drift as the agent’s behaviour evolves throughout training.

To activate observation logging you only have to add the MazeEnvMonitoringWrapper to your environment wrapper stack in your yaml config:

# @package wrappers
maze.core.wrappers.monitoring_wrapper.MazeEnvMonitoringWrapper:
    observation_logging: true
    action_logging: false
    reward_logging: false

If you are using plain Python you can start with the code snippet below.

from maze.core.wrappers.maze_gym_env_wrapper import GymMazeEnv
from maze.core.wrappers.monitoring_wrapper import MazeEnvMonitoringWrapper

env = GymMazeEnv(env="CartPole-v0")
env = MazeEnvMonitoringWrapper.wrap(env, observation_logging=True, action_logging=False, reward_logging=False)

For both cases observations will be logged and distribution plots will be added to Tensorboard.

Maze visualizes observations on a per-epoch basis in the DISTRIBUTIONS and HISTOGRAMS tab of Tensorboard. By using the slider above the graphs you can step through the training epochs and see how the observation distribution evolves over time.

Below you see an example for both versions (just click the figure to view it in large).

_images/tb_obs_distributions.png _images/tb_obs_histogram.png

Note that two different versions of the observation distribution are logged:

  • observation_original: distribution of the original observation returned by the environment.

  • observation_processed: distribution of the observation after processing (e.g. pre-processing or normalization).

This is useful to verify if the applied observation processing steps yield the expected result.

Observation Visualization

Maze additionally provides the option to directly visualizes observations presented to your policy and value networks as images in Tensorboard.

To activate observation visualization you only have to add the ObservationVisualizationWrapper to your environment wrapper stack in your yaml config:

# @package wrappers
maze.core.wrappers.observation_visualization_wrapper.ObservationVisualizationWrapper:
    plot_function: my_project.visualization_functions.plot_1c_image_stack

and provide a reference to a custom plotting function (here, plot_1c_image_stack).

my_project.visualization_functions.plot_1c_image_stack.py
from typing import List, Tuple

import numpy as np
import matplotlib.pyplot as plt


def plot_1c_image_stack(value: List[np.ndarray], groups: Tuple[str, str], **kwargs) -> None:
    """Plots a stack of single channel images with shape [N_STACK x H x W] using imshow.

    :param value: A list of image stacks.
    :param groups: A tuple containing step key and observation name.
    :param kwargs: Additional plotting relevant arguments.
    """

    # extract step key and observation name to enter appropriate plotting branch
    step_key, obs_name = groups

    fig = None
    # check which observation of the dict-space to visualize
    if step_key == 'step_key_0' and obs_name == 'observation-rgb2gray-resize_img':

        # randomly select one observation
        idx = np.random.random_integers(0, len(value), size=1)[0]
        obs = value[idx]
        assert obs.ndim == 3
        n_channels = obs.shape[0]
        min_val, max_val = np.min(obs), np.max(obs)

        # plot the observation
        fig = plt.figure(figsize=(max(5, 5 * n_channels), 5))
        for i, img in enumerate(obs):
            plt.subplot(1, n_channels, i+1)
            plt.imshow(img, interpolation="nearest", vmin=min_val, vmax=max_val, cmap="magma")
            plt.colorbar()

    return fig

The function above visualizes the observation observation-rgb2gray-resize_img (a single-channel image stack) as a subplot containing three individual images:

_images/tb_obs_visualization.png

Where to Go Next

Runner Concept

In Maze, Runners are the entity responsible for launching and administering any job you start from a command line (like training or rollouts). They interpret the configuration and make sure the appropriate elements (models, trainers, etc.) are created, configured, and launched.

For a more detailed description of the runner concept, see Hydra overview. If you need to write custom runners for your project, see the documentation for custom configuration.

Indices and tables