Agents Module

Module containing the different categories of agents.

Module contents

The rl_agents.agents module includes the RL agents classes and utilities. It includes the MAB variants, tabular methods and Deep RL.

MAB Submodule

The rl_agents.agents.mab submodule includes:

  • Epsilon Greedy
  • Decreasing Epsilon Greedy
  • UCBs: UCB, UCB1, UCB2
  • Softmax
  • Pursuit
class EpsilonGreedy(n_arms, epsilon)[source]

Bases: rl_agents.agents.mab.base.BaseMAB

Epsilon-Greedy agent.

The agent uses the epsilon-greedy approach to solve the Multi-Armed bandit problem.

The parameter \(\epsilon\) is used for the exploration-exploitation trade-off. With probability \(\epsilon\) the agent selects a random action, otherwise it selects the action that has the best average reward.

Parameters:
n_arms : int

Number of actions (arms) of the MAB.

epsilon : float

Probability of selecting a random action.

Attributes

means (numpy.array(float, ndim=1)) Vector containing the average reward of each arm.
trials (numpy.array(float, ndim=1)) Vector containing the number of trials made to each arm.
learn(self, a_idx, reward)[source]

Make EpsilonGreedy agent learn from the interaction.

The EpsilonGreedy agent learns from its previous choice and the reward received from this action. Updates the means and the trials.

Parameters:
reward : float

Reward received from the system after taking action a_idx.

a_idx : int

Index of the arm pulled (action taken).

predict(self)[source]

Predict next action.

With probability \(\epsilon\) the agent selects a random arm. With probability \(1 - \epsilon\) the agent selects the arm that has the best average reward.

Returns:
int

Index of chosen action.

class DecayEpsilon(n_arms, max_epsilon, decay)[source]

Bases: rl_agents.agents.mab.base.BaseMAB

Agent that follows an epsilon-decreasing policy.

The agent uses the epsilon-greedy approach to solve the Multi-Armed bandit problem, but with a decay in the epsilon.

The parameter \(\epsilon\) is used for the exploration-exploitation trade-off. With probability \(\epsilon\) the agent selects a random action, otherwise it selects the action that has the best average reward. After each interaction the epsilon is updated as epsilon = epsilon * decay.

Parameters:
n_arms : int

Number of actions (arms) of the MAB.

max_epsilon : float

Initial epsilon.

decay : float

Decay of the epsilon.

Attributes

epsilon (float) Epsilon of the agent. Constantly updated as epsilon = epsilon*decay
means (numpy.array(float, ndim=1)) Vector containing the average reward of each arm.
trials (numpy.array(float, ndim=1)) Vector containing the number of trials made to each arm.
learn(self, a_idx, reward)[source]

Make the DecayEpsilon agent learn from the interaction.

The MAB agent learns from its previous choice and the reward received from this action. Updates the means and the trials.

Parameters:
reward : float

Reward received from the system after taking action a_idx.

a_idx : int

Index of the arm pulled (action taken).

predict(self)[source]

Predict next action and update epsilon.

With probability \(\epsilon\) the agent selects a random arm. With probability \(1 - \epsilon\) the agent selects the arm that has the best average reward.

Returns:
int

Index of chosen action.

class UCB(n_arms, p)[source]

Bases: rl_agents.agents.mab.base.BaseMAB

MAB Agent following a Upper Confidence Bound policy.

The UCB selects the action that maximizes the function given by:

\[f(i) = \mu_i + U_i,\]

where \(\mu_i\) is the average reward of arm \(i\), and \(U_i\) is given by:

\[U_i = \sqrt{\frac{-\log{p}}{2 N_i} },\]

where \(N_i\) is the number of pulls made to arm \(i\).

Parameters:
n_arms : int

Number of actions (arms) of the MAB.

p : float

Probability of the true value being above the estimate plus the bound.

Attributes

means (numpy.array(float, ndim=1)) Vector containing the average reward of each arm.
trials (numpy.array(float, ndim=1)) Vector containing the number of trials made to each arm.
bounds (numpy.array(float, ndim=1)) Vector containing the upper bounds of each arm.
t (int) Total trial counter.
learn(self, a_idx, reward)[source]

Learn from the interaction.

Update the means, the bounds and the trials.

Parameters:
reward : float

Reward received from the system after taking action a_idx.

a_idx : int

Index of the arm pulled (action taken).

predict(self)[source]

Predict next action.

Pulls each arm once, then chooses the arm that gives the best mean + bound.

Returns:
int

Index of chosen action.

class UCB1(n_arms, c=4)[source]

Bases: rl_agents.agents.mab.base.BaseMAB

Short summary.

Parameters:
n_arms : type

Description of parameter n_arms.

c : type

Description of parameter c.

Attributes

means (type) Description of attribute means.
trials (type) Description of attribute trials.
bounds (type) Description of attribute bounds.
t (type) Description of attribute t.
n_arms  
c  
learn(self, a_idx, reward)[source]

Short summary.

Parameters:
a_idx : type

Description of parameter a_idx.

reward : type

Description of parameter reward.

Returns:
type

Description of returned object.

predict(self)[source]

Short summary.

Returns:
type

Description of returned object.

class UCB2(n_arms, alpha)[source]

Bases: rl_agents.agents.mab.base.BaseMAB

Short summary.

Parameters:
n_arms : type

Description of parameter n_arms.

alpha : type

Description of parameter alpha.

Attributes

means (type) Description of attribute means.
trials (type) Description of attribute trials.
bounds (type) Description of attribute bounds.
rj (type) Description of attribute rj.
t (type) Description of attribute t.
counter (type) Description of attribute counter.
current (type) Description of attribute current.
n_arms  
alpha  
learn(self, a_idx, reward)[source]

Short summary.

Parameters:
a_idx : type

Description of parameter a_idx.

reward : type

Description of parameter reward.

Returns:
type

Description of returned object.

predict(self)[source]

Short summary.

Returns:
type

Description of returned object.

class Softmax(n_arms, temperature)[source]

Bases: rl_agents.agents.mab.base.BaseMAB

Short summary.

Parameters:
n_arms : type

Description of parameter n_arms.

temperature : type

Description of parameter temperature.

Attributes

means (type) Description of attribute means.
p_arms (type) Description of attribute p_arms.
trials (type) Description of attribute trials.
n_arms  
temperature  
learn(self, a_idx, reward)[source]

Short summary.

Parameters:
a_idx : type

Description of parameter a_idx.

reward : type

Description of parameter reward.

Returns:
type

Description of returned object.

predict(self)[source]

Short summary.

Returns:
type

Description of returned object.

class Pursuit(n_arms, beta)[source]

Bases: rl_agents.agents.mab.base.BaseMAB

Short summary.

Parameters:
n_arms : type

Description of parameter n_arms.

beta : type

Description of parameter beta.

Attributes

means (type) Description of attribute means.
p_arms (type) Description of attribute p_arms.
trials (type) Description of attribute trials.
n_arms  
beta  
learn(self, a_idx, reward)[source]

Short summary.

Parameters:
reward : type

Description of parameter reward.

a_idx : type

Description of parameter a_idx.

Returns:
type

Description of returned object.

predict(self)[source]

Short summary.

Returns:
type

Description of returned object.