Agents Module¶
Module containing the different categories of agents.
Module contents¶
The rl_agents.agents module includes the RL agents
classes and utilities. It includes the MAB variants,
tabular methods and Deep RL.
MAB Submodule¶
The rl_agents.agents.mab submodule includes:
- Epsilon Greedy
- Decreasing Epsilon Greedy
- UCBs: UCB, UCB1, UCB2
- Softmax
- Pursuit
-
class
EpsilonGreedy(n_arms, epsilon)[source]¶ Bases:
rl_agents.agents.mab.base.BaseMABEpsilon-Greedy agent.
The agent uses the epsilon-greedy approach to solve the Multi-Armed bandit problem.
The parameter \(\epsilon\) is used for the exploration-exploitation trade-off. With probability \(\epsilon\) the agent selects a random action, otherwise it selects the action that has the best average reward.
Parameters: - n_arms : int
Number of actions (arms) of the MAB.
- epsilon : float
Probability of selecting a random action.
Attributes
means (numpy.array(float, ndim=1)) Vector containing the average reward of each arm. trials (numpy.array(float, ndim=1)) Vector containing the number of trials made to each arm. -
learn(self, a_idx, reward)[source]¶ Make EpsilonGreedy agent learn from the interaction.
The EpsilonGreedy agent learns from its previous choice and the reward received from this action. Updates the means and the trials.
Parameters: - reward : float
Reward received from the system after taking action a_idx.
- a_idx : int
Index of the arm pulled (action taken).
-
class
DecayEpsilon(n_arms, max_epsilon, decay)[source]¶ Bases:
rl_agents.agents.mab.base.BaseMABAgent that follows an epsilon-decreasing policy.
The agent uses the epsilon-greedy approach to solve the Multi-Armed bandit problem, but with a decay in the epsilon.
The parameter \(\epsilon\) is used for the exploration-exploitation trade-off. With probability \(\epsilon\) the agent selects a random action, otherwise it selects the action that has the best average reward. After each interaction the epsilon is updated as epsilon = epsilon * decay.
Parameters: - n_arms : int
Number of actions (arms) of the MAB.
- max_epsilon : float
Initial epsilon.
- decay : float
Decay of the epsilon.
Attributes
epsilon (float) Epsilon of the agent. Constantly updated as epsilon = epsilon*decay means (numpy.array(float, ndim=1)) Vector containing the average reward of each arm. trials (numpy.array(float, ndim=1)) Vector containing the number of trials made to each arm. -
learn(self, a_idx, reward)[source]¶ Make the DecayEpsilon agent learn from the interaction.
The MAB agent learns from its previous choice and the reward received from this action. Updates the means and the trials.
Parameters: - reward : float
Reward received from the system after taking action a_idx.
- a_idx : int
Index of the arm pulled (action taken).
-
class
UCB(n_arms, p)[source]¶ Bases:
rl_agents.agents.mab.base.BaseMABMAB Agent following a Upper Confidence Bound policy.
The UCB selects the action that maximizes the function given by:
\[f(i) = \mu_i + U_i,\]where \(\mu_i\) is the average reward of arm \(i\), and \(U_i\) is given by:
\[U_i = \sqrt{\frac{-\log{p}}{2 N_i} },\]where \(N_i\) is the number of pulls made to arm \(i\).
Parameters: - n_arms : int
Number of actions (arms) of the MAB.
- p : float
Probability of the true value being above the estimate plus the bound.
Attributes
means (numpy.array(float, ndim=1)) Vector containing the average reward of each arm. trials (numpy.array(float, ndim=1)) Vector containing the number of trials made to each arm. bounds (numpy.array(float, ndim=1)) Vector containing the upper bounds of each arm. t (int) Total trial counter.
-
class
UCB1(n_arms, c=4)[source]¶ Bases:
rl_agents.agents.mab.base.BaseMABShort summary.
Parameters: - n_arms : type
Description of parameter n_arms.
- c : type
Description of parameter c.
Attributes
means (type) Description of attribute means. trials (type) Description of attribute trials. bounds (type) Description of attribute bounds. t (type) Description of attribute t. n_arms c
-
class
UCB2(n_arms, alpha)[source]¶ Bases:
rl_agents.agents.mab.base.BaseMABShort summary.
Parameters: - n_arms : type
Description of parameter n_arms.
- alpha : type
Description of parameter alpha.
Attributes
means (type) Description of attribute means. trials (type) Description of attribute trials. bounds (type) Description of attribute bounds. rj (type) Description of attribute rj. t (type) Description of attribute t. counter (type) Description of attribute counter. current (type) Description of attribute current. n_arms alpha
-
class
Softmax(n_arms, temperature)[source]¶ Bases:
rl_agents.agents.mab.base.BaseMABShort summary.
Parameters: - n_arms : type
Description of parameter n_arms.
- temperature : type
Description of parameter temperature.
Attributes
means (type) Description of attribute means. p_arms (type) Description of attribute p_arms. trials (type) Description of attribute trials. n_arms temperature
-
class
Pursuit(n_arms, beta)[source]¶ Bases:
rl_agents.agents.mab.base.BaseMABShort summary.
Parameters: - n_arms : type
Description of parameter n_arms.
- beta : type
Description of parameter beta.
Attributes
means (type) Description of attribute means. p_arms (type) Description of attribute p_arms. trials (type) Description of attribute trials. n_arms beta