class maze.train.trainers.sac.sac_algorithm_config.SACAlgorithmConfig(n_epochs: int, n_rollout_steps: int, lr: float, entropy_coef: float, gamma: float, max_grad_norm: float, num_actors: int, batch_size: int, num_batches_per_iter: int, tau: float, target_update_interval: int, device: str, entropy_tuning: bool, target_entropy_multiplier: float, entropy_coef_lr: float, split_rollouts_into_transitions: bool, replay_buffer_size: int, initial_buffer_size: int, initial_sampling_policy: Union[maze.core.agent.policy.Policy, None, Mapping[str, Any], Any], rollouts_per_iteration: int, epoch_length: int, patience: int, rollout_evaluator: maze.train.trainers.common.evaluators.rollout_evaluator.RolloutEvaluator)

Algorithm parameters for SAC.

batch_size: int

batch size to be sampled from the buffer

device: str

device the learner should work one (ether cpu or cuda)

entropy_coef: float

entropy coefficient to use if entropy tuning is set to false (called alpha in the org paper)

entropy_coef_lr: float

Learning for entropy tuning

entropy_tuning: bool

Specify whether to tune the entropy in the return computation or used a static value (called alpha tuning in the org paper)

epoch_length: int

number of updates per epoch

gamma: float

discount factor

initial_buffer_size: int

The initial buffer size, where transaction are sampled with the initial sampling policy

initial_sampling_policy: Union[maze.core.agent.policy.Policy, None, Mapping[str, Any], Any]

The policy used to initially fill the replay buffer

lr: float

learning rate

max_grad_norm: float

max grad norm for gradient clipping, ignored if value==0

n_epochs: int

number of epochs to train

n_rollout_steps: int

number of rolloutstep of each epoch substep

num_actors: int

number of actors to be run

num_batches_per_iter: int

Number of batches to update on in each iteration

patience: int

number of steps used for early stopping

replay_buffer_size: int

The size of the replay buffer

rollout_evaluator: maze.train.trainers.common.evaluators.rollout_evaluator.RolloutEvaluator

Rollout evaluator.

rollouts_per_iteration: int

Number of rollouts collected from the actor in each iteration

split_rollouts_into_transitions: bool

Specify whether all computed rollouts should be split into transitions before processing them

target_entropy_multiplier: float

Specify an optional multiplier for the target entropy. This value is multiplied with the default target entropy computation (called alpha tuning in the paper):

  • discrete spaces: target_entropy = target_entropy_multiplier * ( - 0.98 * (-log (1 / cardinality(A)))

  • continues spaces: target_entropy = target_entropy_multiplier * (- dim(A)) (e.g., -6 for HalfCheetah-v1)

target_update_interval: int

Specify in what intervals to update the target networks

tau: float

Parameter weighting the soft update of the target network