Agents Module¶

Module containing the different categories of agents.

Module contents¶

The rl_agents.agents module includes the RL agents classes and utilities. It includes the MAB variants, tabular methods and Deep RL.

MAB Submodule¶

The rl_agents.agents.mab submodule includes:

Epsilon Greedy
Decreasing Epsilon Greedy
UCBs: UCB, UCB1, UCB2
Softmax
Pursuit

class EpsilonGreedy(n_arms, epsilon)[source]¶

Bases: rl_agents.agents.mab.base.BaseMAB

Epsilon-Greedy agent.

The agent uses the epsilon-greedy approach to solve the Multi-Armed bandit problem.

The parameter \(\epsilon\) is used for the exploration-exploitation trade-off. With probability \(\epsilon\) the agent selects a random action, otherwise it selects the action that has the best average reward.

Parameters:	n_arms : int Number of actions (arms) of the MAB. epsilon : float Probability of selecting a random action.

Attributes

means	(numpy.array(float, ndim=1)) Vector containing the average reward of each arm.
trials	(numpy.array(float, ndim=1)) Vector containing the number of trials made to each arm.

learn(self, a_idx, reward)[source]¶

Make EpsilonGreedy agent learn from the interaction.

The EpsilonGreedy agent learns from its previous choice and the reward received from this action. Updates the means and the trials.

Parameters:	reward : float Reward received from the system after taking action a_idx. a_idx : int Index of the arm pulled (action taken).

predict(self)[source]¶

Predict next action.

With probability \(\epsilon\) the agent selects a random arm. With probability \(1 - \epsilon\) the agent selects the arm that has the best average reward.

Returns:	int Index of chosen action.

class DecayEpsilon(n_arms, max_epsilon, decay)[source]¶

Bases: rl_agents.agents.mab.base.BaseMAB

Agent that follows an epsilon-decreasing policy.

The agent uses the epsilon-greedy approach to solve the Multi-Armed bandit problem, but with a decay in the epsilon.

The parameter \(\epsilon\) is used for the exploration-exploitation trade-off. With probability \(\epsilon\) the agent selects a random action, otherwise it selects the action that has the best average reward. After each interaction the epsilon is updated as epsilon = epsilon * decay.

Parameters:	n_arms : int Number of actions (arms) of the MAB. max_epsilon : float Initial epsilon. decay : float Decay of the epsilon.

Attributes

epsilon	(float) Epsilon of the agent. Constantly updated as epsilon = epsilon*decay
means	(numpy.array(float, ndim=1)) Vector containing the average reward of each arm.
trials	(numpy.array(float, ndim=1)) Vector containing the number of trials made to each arm.

learn(self, a_idx, reward)[source]¶

Make the DecayEpsilon agent learn from the interaction.

The MAB agent learns from its previous choice and the reward received from this action. Updates the means and the trials.

Parameters:	reward : float Reward received from the system after taking action a_idx. a_idx : int Index of the arm pulled (action taken).

predict(self)[source]¶

Predict next action and update epsilon.

With probability \(\epsilon\) the agent selects a random arm. With probability \(1 - \epsilon\) the agent selects the arm that has the best average reward.

Returns:	int Index of chosen action.

class UCB(n_arms, p)[source]¶

Bases: rl_agents.agents.mab.base.BaseMAB

MAB Agent following a Upper Confidence Bound policy.

The UCB selects the action that maximizes the function given by:

\[f(i) = \mu_i + U_i,\]

where \(\mu_i\) is the average reward of arm \(i\), and \(U_i\) is given by:

\[U_i = \sqrt{\frac{-\log{p}}{2 N_i} },\]

where \(N_i\) is the number of pulls made to arm \(i\).

Parameters:	n_arms : int Number of actions (arms) of the MAB. p : float Probability of the true value being above the estimate plus the bound.

Attributes

means	(numpy.array(float, ndim=1)) Vector containing the average reward of each arm.
trials	(numpy.array(float, ndim=1)) Vector containing the number of trials made to each arm.
bounds	(numpy.array(float, ndim=1)) Vector containing the upper bounds of each arm.
t	(int) Total trial counter.

learn(self, a_idx, reward)[source]¶

Learn from the interaction.

Update the means, the bounds and the trials.

Parameters:	reward : float Reward received from the system after taking action a_idx. a_idx : int Index of the arm pulled (action taken).

predict(self)[source]¶

Predict next action.

Pulls each arm once, then chooses the arm that gives the best mean + bound.

Returns:	int Index of chosen action.

class UCB1(n_arms, c=4)[source]¶

Bases: rl_agents.agents.mab.base.BaseMAB

Short summary.

Parameters:	n_arms : type Description of parameter n_arms. c : type Description of parameter c.

Attributes

means	(type) Description of attribute means.
trials	(type) Description of attribute trials.
bounds	(type) Description of attribute bounds.
t	(type) Description of attribute t.
n_arms
c

learn(self, a_idx, reward)[source]¶

Short summary.

Parameters:	a_idx : type Description of parameter a_idx. reward : type Description of parameter reward.
Returns:	type Description of returned object.

predict(self)[source]¶

Short summary.

Returns:	type Description of returned object.

class UCB2(n_arms, alpha)[source]¶

Bases: rl_agents.agents.mab.base.BaseMAB

Short summary.

Parameters:	n_arms : type Description of parameter n_arms. alpha : type Description of parameter alpha.

Attributes

means	(type) Description of attribute means.
trials	(type) Description of attribute trials.
bounds	(type) Description of attribute bounds.
rj	(type) Description of attribute rj.
t	(type) Description of attribute t.
counter	(type) Description of attribute counter.
current	(type) Description of attribute current.
n_arms
alpha

learn(self, a_idx, reward)[source]¶

Short summary.

Parameters:	a_idx : type Description of parameter a_idx. reward : type Description of parameter reward.
Returns:	type Description of returned object.

predict(self)[source]¶

Short summary.

Returns:	type Description of returned object.

class Softmax(n_arms, temperature)[source]¶

Bases: rl_agents.agents.mab.base.BaseMAB

Short summary.

Parameters:	n_arms : type Description of parameter n_arms. temperature : type Description of parameter temperature.

Attributes

means	(type) Description of attribute means.
p_arms	(type) Description of attribute p_arms.
trials	(type) Description of attribute trials.
n_arms
temperature

learn(self, a_idx, reward)[source]¶

Short summary.

Parameters:	a_idx : type Description of parameter a_idx. reward : type Description of parameter reward.
Returns:	type Description of returned object.

predict(self)[source]¶

Short summary.

Returns:	type Description of returned object.

class Pursuit(n_arms, beta)[source]¶

Bases: rl_agents.agents.mab.base.BaseMAB

Short summary.

Parameters:	n_arms : type Description of parameter n_arms. beta : type Description of parameter beta.

Attributes

means	(type) Description of attribute means.
p_arms	(type) Description of attribute p_arms.
trials	(type) Description of attribute trials.
n_arms
beta

learn(self, a_idx, reward)[source]¶

Short summary.

Parameters:	reward : type Description of parameter reward. a_idx : type Description of parameter a_idx.
Returns:	type Description of returned object.

predict(self)[source]¶

Short summary.

Returns:	type Description of returned object.