Policy¶

class maze.core.agent.policy.Policy¶

Structured policy class designed to work with structured environments. (see StructuredEnv).

It encapsulates policies and queries them for actions according to the provided policy ID.

abstract compute_action(observation: Dict[str, numpy.ndarray], maze_state: Optional[Any], env: Optional[maze.core.env.base_env.BaseEnv], actor_id: Optional[maze.core.env.structured_env.ActorID] = None, deterministic: bool = False) → Dict[str, Union[int, numpy.ndarray]]¶

Query a policy that corresponds to the given actor ID for action.

Parameters

observation – Current observation of the environment
maze_state – Current state representation of the environment (only provided if needs_state() returns True)
env – The environment instance (only provided if needs_env() returns True)
actor_id – ID of the actor to query policy for (does not have to be provided if there is only one actor and one policy in this environment)
deterministic – Specify if the action should be computed deterministically

Returns

Next action to take

abstract compute_top_action_candidates(observation: Dict[str, numpy.ndarray], num_candidates: Optional[int], maze_state: Optional[Any], env: Optional[maze.core.env.base_env.BaseEnv], actor_id: Optional[maze.core.env.structured_env.ActorID] = None) → Tuple[Sequence[Dict[str, Union[int, numpy.ndarray]]], Sequence[float]]¶

Get the top :num_candidates actions as well as the probabilities, q-values, .. leading to the decision.

Parameters

observation – Current observation of the environment
num_candidates – The number of actions that should be returned. If None all candidates are returned.
maze_state – Current state representation of the environment (only provided if needs_state() returns True)
env – The environment instance (only provided if needs_env() returns True)
actor_id – ID of actor to query policy for (does not have to be provided if policies dict contains only 1 policy)

Returns

a tuple of sequences, where the first sequence corresponds to the possible actions, the other sequence to the associated scores (e.g, probabilities or Q-values).

needs_env() → bool ¶

Similar to needs_state, the policy implementation declares if it operates solely on observations (needs_env returns False) or if it also requires the env object in order to compute the action.

Requiring the env should be regarded as anti-pattern, but is supported for special cases like the MCTS policy, which requires cloning support from the environment.

:return Per default policies return False.

abstract needs_state() → bool ¶

The policy implementation declares if it operates solely on observations (needs_state returns False) or if it also requires the state object in order to compute the action.

Note that requiring the state object comes with performance implications, especially in multi-node distributed workloads, where both objects would need to be transferred over the network.

abstract seed(seed: int) → None ¶

Seed the policy to be used. Here the given seed should be used to initialize any random number generator or random state used to sample from distributions or any other form of randomness in the policy. This ensures that when the policy is explicit seeded it is reproducible.

Parameters: seed – The seed to use for all random state objects withing the policy.