PPOAlgorithmConfig¶

class maze.train.trainers.ppo.ppo_algorithm_config.PPOAlgorithmConfig(n_epochs: int, epoch_length: int, patience: int, critic_burn_in_epochs: int, n_rollout_steps: int, lr: float, gamma: float, gae_lambda: float, policy_loss_coef: float, value_loss_coef: float, entropy_coef: float, max_grad_norm: float, device: str, batch_size: int, n_optimization_epochs: int, clip_range: float, rollout_evaluator: RolloutEvaluator, n_training_seeds: int)¶

Algorithm parameters for multi-step PPO model.

batch_size: int¶: The batch size used for policy and value updates

clip_range: float¶: Clipping parameter of surrogate loss

critic_burn_in_epochs: int¶: Number of critic (value function) burn in epochs

device: str¶: Either “cpu” or “cuda”

entropy_coef: float¶: weight of entropy loss

epoch_length: int¶: number of updates per epoch

gae_lambda: float¶: bias vs variance trade of factor for GAE

gamma: float¶: discounting factor

lr: float¶: learning rate

max_grad_norm: float¶: The maximum allowed gradient norm during training

n_epochs: int¶: number of epochs to train

n_optimization_epochs: int¶: Number of epochs for policy and value optimization

n_rollout_steps: int¶: Number of steps taken for each rollout

n_training_seeds: int¶: Number of seeds to be generated for seeding the environment except when passing a list of explicit seeds.

patience: int¶: number of steps used for early stopping

policy_loss_coef: float¶: weight of policy loss

rollout_evaluator: RolloutEvaluator¶: Rollout evaluator.

value_loss_coef: float¶: weight of value loss