RewardAggregatorInterface¶

class maze.core.env.reward.RewardAggregatorInterface¶

Event aggregation object for reward customization and shaping.

clone_from(reward_aggregator: maze.core.env.reward.RewardAggregatorInterface) → None ¶

Clones the state of the provided reward aggregator.

Parameters: reward_aggregator – The reward aggregator to clone from.

get_interfaces() → List[Type[abc.ABC]]¶

(overrides Subscriber)

Declare which events this reward aggregator should be notified about.

Often, the current state of the environment does not provide enough information to calculate a reward. In such cases, the reward aggregator collects events from the environment. (E.g., was a new piece replenished during cutting? Did the agent attempt an invalid cut?) This method declares which events this aggregator should collect.

By default, this returns an empty list (as for simpler cases, maze state is enough and no events are needed).

For more complex scenarios, override this method and specify which interfaces are needed.

Returns: A list of event interface classes to listen to

abstract summarize_reward(maze_state: Optional[Any] = None) → Union[float, numpy.ndarray]¶

Summarize the reward for this step. Expected to be called once per structured step.

Calculates the reward based either on the current maze state of the environment (provided as an argument), or by querying the events dispatched by the environment during the last step. The former is simpler, the latter is more flexible, as environment state does not always carry all the necessary information on what took place in the last step.

In most scenarios, the reward is returned either as a scalar, or as an numpy array corresponding to the number of actors that acted in the last step. Scalar is useful in scenarios with single actor only, or when per-actor reward cannot be easily attributed and a shared scalar reward makes more sense. An array is useful in scenarios where per-actor reward makes sense, such as in multi-agent setting.

Parameters: maze_state – Current state of the environment.
Returns: Reward for the last structured step. In most cases, either a scalar or an array with an item for each actor active during the last step.