SACAlgorithmConfig¶

class maze.train.trainers.sac.sac_algorithm_config.SACAlgorithmConfig(n_epochs: int, n_rollout_steps: int, lr: float, entropy_coef: float, gamma: float, max_grad_norm: float, num_actors: int, batch_size: int, num_batches_per_iter: int, tau: float, target_update_interval: int, device: str, entropy_tuning: bool, target_entropy_multiplier: float, entropy_coef_lr: float, split_rollouts_into_transitions: bool, replay_buffer_size: int, initial_buffer_size: int, initial_sampling_policy: Policy | None | Mapping[str, Any] | Any, rollouts_per_iteration: int, epoch_length: int, patience: int, rollout_evaluator: RolloutEvaluator)¶

Algorithm parameters for SAC.

batch_size: int¶: batch size to be sampled from the buffer

device: str¶: device the learner should work one (ether cpu or cuda)

entropy_coef: float¶: entropy coefficient to use if entropy tuning is set to false (called alpha in the org paper)

entropy_coef_lr: float¶: Learning for entropy tuning

entropy_tuning: bool¶: Specify whether to tune the entropy in the return computation or used a static value (called alpha tuning in the org paper)

epoch_length: int¶: number of updates per epoch

gamma: float¶: discount factor

initial_buffer_size: int¶: The initial buffer size, where transaction are sampled with the initial sampling policy

initial_sampling_policy: Policy | None | Mapping[str, Any] | Any¶: The policy used to initially fill the replay buffer

lr: float¶: learning rate

max_grad_norm: float¶: max grad norm for gradient clipping, ignored if value==0

n_epochs: int¶: number of epochs to train

n_rollout_steps: int¶: number of rolloutstep of each epoch substep

num_actors: int¶: number of actors to be run

num_batches_per_iter: int¶: Number of batches to update on in each iteration

patience: int¶: number of steps used for early stopping

replay_buffer_size: int¶: The size of the replay buffer

rollout_evaluator: RolloutEvaluator¶: Rollout evaluator.

rollouts_per_iteration: int¶: Number of rollouts collected from the actor in each iteration

split_rollouts_into_transitions: bool¶: Specify whether all computed rollouts should be split into transitions before processing them

target_entropy_multiplier: float¶

Specify an optional multiplier for the target entropy. This value is multiplied with the default target entropy computation (called alpha tuning in the paper):

discrete spaces: target_entropy = target_entropy_multiplier * ( - 0.98 * (-log (1 / cardinality(A)))

continues spaces: target_entropy = target_entropy_multiplier * (- dim(A)) (e.g., -6 for HalfCheetah-v1)

target_update_interval: int¶: Specify in what intervals to update the target networks

tau: float¶: Parameter weighting the soft update of the target network