# Combining Maze with other RL Frameworks¶

This tutorial explains how to use general Maze features in combination with existing RL frameworks. In particular, we will apply observation normalization before optimizing a policy with the stable-baselines3 A2C trainer. When adding new features to Maze we put a strong emphasis on reusablity to allow you to make use of as much of these features as possible but still give you the opportunity to stick to the optimization framework you are most comfortable or familiar with.

Since RLlib already has a dedicated spot within Maze we rely on stable-baselines3 for this tutorial. However, it is important to note that the examples below will also work with any other Python-based RL framework compatible with Gym environments.

We provide two different versions showing how to arrive at an observation normalized environment. The first one is written in plain Python where the second reproduces the Python example with a Hydra configuration.

Note

Although, this tutorial explains how to reuse observation normalization there is of course no limitation to this sole feature. So if you find this useful we definitely recommend you to browse through our Environment Customization section in the sidebar.

## Reusing Environment Customization Features¶

The basis for this tutorial is the official getting started snippet of stable-baselines showing how to train and run A2C on a CartPole environment. We added a few comments to make things a bit more explicit.

If you would like to run this example yourself make sure to install stable-baselines3 first.

"""
Getting started example from:
"""

import gym
from stable_baselines3 import A2C

# ENV INSTANTIATION
# -----------------
env = gym.make('CartPole-v0')

# TRAINING AND ROLLOUT
# --------------------

model = A2C('MlpPolicy', env, verbose=1)
model.learn(total_timesteps=10000)

obs = env.reset()
for i in range(1000):
action, _state = model.predict(obs, deterministic=True)
obs, reward, done, info = env.step(action)
env.render()
if done:
obs = env.reset()


Below you find exactly the same example but with an observation normalized environment. The following modifications compared to the example above are required:

• Instantiate a GymMazeEnv instead of a standard Gym environment

• Wrap the environment with the ObservationNormalizationWrapper

• Estimate normalization statistics from actual environment interactions

As you might already have experienced, re-coding these steps for different environments and experiments can get quite cumbersome. The wrapper also dumps the estimated statistics in a file (statistics.pkl) to reuse them later on for agent deployment.

"""
Contains an example showing how to train
an observation normalized maze environment with stable-baselines.
"""

from maze.core.agent.random_policy import RandomPolicy
from maze.core.wrappers.maze_gym_env_wrapper import GymMazeEnv
from maze.core.wrappers.no_dict_spaces_wrapper import NoDictSpacesWrapper
from maze.core.wrappers.observation_normalization.observation_normalization_utils import \
obtain_normalization_statistics
from maze.core.wrappers.observation_normalization.observation_normalization_wrapper import \
ObservationNormalizationWrapper

from stable_baselines3 import A2C

# ENV INSTANTIATION: a GymMazeEnv instead of a gym.Env
# ----------------------------------------------------
env = GymMazeEnv('CartPole-v0')

# OBSERVATION NORMALIZATION
# -------------------------

# we wrap the environment with the ObservationNormalizationWrapper
# (you can find details on this in the section on observation normalization)
env = ObservationNormalizationWrapper(
env=env,
default_strategy="maze.normalization_strategies.MeanZeroStdOneObservationNormalizationStrategy",
default_strategy_config={"clip_range": (None, None), "axis": 0},
default_statistics=None, statistics_dump="statistics.pkl",
sampling_policy=RandomPolicy(env.action_spaces_dict),
exclude=None, manual_config=None)

# next we estimate the normalization statistics by
# (1) collecting observations by randomly sampling 1000 transitions from the environment
# (2) computing the statistics according to the define normalization strategy
normalization_statistics = obtain_normalization_statistics(env, n_samples=1000)
env.set_normalization_statistics(normalization_statistics)

# after this step all observations returned by the environment will be normalized

# stable-baselines does not support dict spaces so we have to remove them
env = NoDictSpacesWrapper(env)

# TRAINING AND ROLLOUT (remains unchanged)
# ----------------------------------------

model = A2C('MlpPolicy', env, verbose=1)
model.learn(total_timesteps=10000)

obs = env.reset()
for i in range(1000):
action, _state = model.predict(obs, deterministic=True)
obs, reward, done, info = env.step(action)
env.render()
if done:
obs = env.reset()


## Reusing the Hydra Configuration System¶

This example is identical to the the previous one but instead of instantiated everything directly from Python it utilizes the Hydra configuration system.

"""
Contains an example showing how to train an observation normalized maze environment
instantiated from a hydra config with stable-baselines.
"""

from maze.core.utils.config_utils import make_env_from_hydra
from maze.core.wrappers.no_dict_spaces_wrapper import NoDictSpacesWrapper
from maze.core.wrappers.observation_normalization.observation_normalization_utils import \
obtain_normalization_statistics

from stable_baselines3 import A2C

# ENV INSTANTIATION: from hydra config file
# -----------------------------------------
env = make_env_from_hydra("conf")

# OBSERVATION NORMALIZATION
# -------------------------

# next we estimate the normalization statistics by
# (1) collecting observations by randomly sampling 1000 transitions from the environment
# (2) computing the statistics according to the define normalization strategy
normalization_statistics = obtain_normalization_statistics(env, n_samples=1000)
env.set_normalization_statistics(normalization_statistics)

# stable-baselines does not support dict spaces so we have to remove them
env = NoDictSpacesWrapper(env)

# TRAINING AND ROLLOUT (remains unchanged)
# ----------------------------------------

model = A2C('MlpPolicy', env, verbose=1)
model.learn(total_timesteps=10000)

obs = env.reset()
for i in range(1000):
action, _state = model.predict(obs, deterministic=True)
obs, reward, done, info = env.step(action)
env.render()
if done:
obs = env.reset()


This is the corresponding hydra config:

# @package _global_

# defines environment to instantiate
env:
_target_: maze.core.wrappers.maze_gym_env_wrapper.GymMazeEnv
env: "CartPole-v0"

# defines wrappers to apply
wrappers:
# Observation Normalization Wrapper
maze.core.wrappers.observation_normalization.observation_normalization_wrapper.ObservationNormalizationWrapper:
default_strategy: maze.normalization_strategies.MeanZeroStdOneObservationNormalizationStrategy
default_strategy_config:
clip_range: [~, ~]
axis: 0
default_statistics: ~
statistics_dump: statistics.pkl
sampling_policy:
_target_: maze.core.agent.random_policy.RandomPolicy
exclude: ~
manual_config: ~