Trainers and Training Runners

This page contains the reference documentation for trainers and training runners:

General

These are general interfaces, classes and utility functions for trainers and training runners:

Trainer

Interface for trainers.

TrainingRunner

Base class for training runner implementations.

TrainConfig

Top-level configuration structure.

ModelConfig

Model configuration structure.

AlgorithmConfig

Base class for all specific algorithm configurations.

ModelSelectionBase

Base class for model selection strategies.

BestModelSelection

Best model selection strategy.

Evaluator

Abstract interface for policy evaluation.

MultiEvaluator

Evaluates the given policy using multiple different evaluators (ran in sequence).

RolloutEvaluator

Evaluates a given policy by rolling it out and collecting the mean reward.

ValueTransform

Value transformation (e.g.

ReduceScaleValueTransform

Scale reduction value transform according to Pohlen et al (2018).

support_to_scalar

Convert support vector to scalar by probability weighted interpolation.

scalar_to_support

Converts tensor of scalars into probability support vectors corresponding to the provided range.

BaseReplayBuffer

Abstract interface for all replay buffer implementations.

UniformReplayBuffer

Replay buffer for off policy learning.

Trainers

These are interfaces, classes and utility functions for built-in trainers:

Actor-Critics (AC)

ACRunner

Abstract baseclass of AC runners.

ACDevRunner

Runner for single-threaded training, based on SequentialVectorEnv.

ACLocalRunner

Runner for locally distributed training, based on SubprocVectorEnv.

ActorCritic

Base class for actor critic trainers.

ActorCriticEvents

Event interface, defining statistics emitted by the A2CTrainer.

A2C

Advantage Actor Critic.

A2CAlgorithmConfig

Algorithm parameters for multi-step A2C model.

PPO

Proximal Policy Optimization trainer.

PPOAlgorithmConfig

Algorithm parameters for multi-step PPO model.

IMPALA

Multi step advantage actor critic.

ImpalaAlgorithmConfig

Algorithm parameters for Impala.

ImpalaEvents

Events specific for the impala algorithm, in order to record and analyse it’s behaviour in more detail

ImpalaRunner

Common superclass for IMPALA runners, implementing the main training controls.

ImpalaDevRunner

Runner for single-threaded training, based on SequentialVectorEnv.

ImpalaLocalRunner

Runner for locally distributed training, based on SubprocVectorEnv.

log_probs_from_logits_and_actions_and_spaces

Computes action log-probs from policy logits, actions and acton_spaces.

from_logits

V-trace for softmax policies.

from_importance_weights

V-trace from log importance weights.

get_log_rhos

With the selected log_probs for multi-discrete actions of behavior and target policies we compute the log_rhos for calculating the vtrace.

SAC

Multi step soft actor critic.

SACAlgorithmConfig

Algorithm parameters for SAC.

SACEvents

Events specific for the SAC algorithm, in order to record and analyse it’s behaviour in more detail

SACRunner

Common superclass for SAC runners, implementing the main training controls.

SACDevRunner

Runner for single-threaded training, based on SequentialVectorEnv.

Evolutionary Strategies (ES)

ESTrainer

Trainer class for OpenAI Evolution Strategies.

ESAlgorithmConfig

Algorithm parameters for evolution strategies model.

ESEvents

Event interface, defining statistics emitted by the ESTrainer.

ESMasterRunner

Baseclass of ES training master runners (serves as basis for dev and other runners).

ESDevRunner

Runner config for single-threaded training, based on ESDummyDistributedRollouts.

SharedNoiseTable

A fixed length vector of deterministically generated pseudo-random floats.

Optimizer

Abstract baseclass of an optimizer to be used with ES.

SGD

Stochastic gradient descent with momentum

Adam

Adam optimizer

ESRolloutResult

Result structure for distributed rollouts.

ESDummyDistributedRollouts

Implementation of the ES distribution by running the rollouts synchronously in the same process.

ESDistributedRollouts

Abstract base class of ES rollout distribution.

ESAbortException

This exception is raised if the current rollout is intentionally aborted.

ESRolloutWorkerWrapper

The rollout generation is bound to a single worker environment by implementing it as a Wrapper class.

get_flat_parameters

Get the parameters of all sub-policies as a single flat vector.

set_flat_parameters

Overwrite the parameters of all sub-policies by a single flat vector.

Imitation Learning (IL) and Learning from Demonstrations (LfD)

ImitationEvents

Event interface defining statistics emitted by the imitation learning trainers.

BCRunner

Dev runner for imitation learning.

BCTrainer

Trainer for behavioral cloning learning.

BCAlgorithmConfig

Algorithm parameters for behavioral cloning.

BCValidationEvaluator

Evaluates a given policy on validation data.

BCLoss

Loss function for behavioral cloning.

Utilities

stack_numpy_dict_list

Stack list of dictionaries holding numpy arrays as values.

unstack_numpy_list_dict

Inverse of stack_numpy_dict_list().

compute_gradient_norm

Computes the cumulative gradient norm of all provided parameters.

stack_torch_dict_list

Stack list of dictionaries holding torch tensors as values.