A comprehensive Python library implementing a variety of contextual and non-contextual multi-armed bandit algorithms—including LinUCB, Epsilon-Greedy, Upper Confidence Bound (UCB), Thompson Sampling, KernelUCB, NeuralLinearBandit, and DecisionTreeBandit—designed for reinforcement learning applications that require decision-making under uncertainty with or without contextual information.
-
Contextual Algorithms:
- LinUCB: Balances exploration and exploitation using linear regression with upper confidence bounds.
- Epsilon-Greedy: Explores randomly with probability epsilon and exploits the best-known option otherwise.
- KernelUCB: Uses kernel methods to capture non-linear relationships in the context space.
- NeuralLinearBandit: Combines neural networks for feature extraction with linear models for prediction.
- DecisionTreeBandit: Employs decision trees to model complex relationships between context and rewards.
-
Non-Contextual Algorithms:
- Upper Confidence Bound (UCB): Selects arms based on upper confidence bounds of estimated rewards.
- Thompson Sampling: Uses Bayesian methods to balance exploration and exploitation.
pip install contextual-bandits-algos
-
Clone the Repository
git clone https://github.com/singhsidhukuldeep/contextual-bandits.git cd contextual-bandits
-
Install the Requirements
pip install -r requirements.txt
-
Install the Package
pip install .
-
Run the Examples
python examples/example_linucb.py python examples/example_epsilon_greedy.py python examples/example_ucb.py python examples/example_thompson_sampling.py python examples/example_kernelucb.py python examples/example_neurallinear.py python examples/example_tree_bandit.py
To run both algorithms in a single script:
python examples/example_usage.py
Run all examples *.py files
cd examples for f in *.py; do echo "$f" & python "$f"; done
Below are detailed descriptions of each algorithm, indicating whether they are contextual or non-contextual, how they work, and examples demonstrating their usage.
- Type: Contextual
- Description: The LinUCB algorithm is a contextual bandit algorithm that uses linear regression to predict the expected reward for each arm given the current context. It balances exploration and exploitation by adding an upper confidence bound to the estimated rewards.
- Model: Assumes that the reward is a linear function of the context features.
- Exploration: Incorporates uncertainty in the estimation by adding a confidence interval (scaled by
alpha
). - Exploitation: Chooses the arm with the highest upper confidence bound.
- Type: Contextual
- Description: The Epsilon-Greedy algorithm selects a random arm with probability
epsilon
(exploration) and the best-known arm with probability1 - epsilon
(exploitation). It uses the context to predict rewards for each arm.
- Model: Uses linear models or other estimators to predict rewards based on context.
- Exploration: With probability
epsilon
, selects a random arm. - Exploitation: With probability
1 - epsilon
, selects the arm with the highest predicted reward.
- Type: Non-Contextual
- Description: The UCB algorithm selects arms based on upper confidence bounds of the estimated rewards, without considering any context. It is suitable when no contextual information is available.
- Model: Estimates the average reward for each arm.
- Exploration: Adds a confidence term to the average reward to explore less-tried arms.
- Exploitation: Chooses the arm with the highest upper confidence bound.
- Type: Non-Contextual
- Description: Thompson Sampling is a Bayesian algorithm that selects arms based on samples drawn from the posterior distributions of the arm's reward probabilities.
- Model: Assumes Bernoulli-distributed rewards for each arm.
- Exploration & Exploitation: Balances both by sampling from the posterior distributions.
- Updates: Updates the posterior distributions based on observed rewards.
- Type: Contextual
- Description: KernelUCB uses kernel methods to capture non-linear relationships between contexts and rewards. It extends the UCB algorithm to a kernelized context space.
- Model: Uses a kernel function (e.g., RBF kernel) to compute similarity between contexts.
- Exploration: Adds an exploration term based on the uncertainty in the kernel space.
- Exploitation: Predicts the expected reward using kernel regression.
- Type: Contextual
- Description: NeuralLinearBandit uses a neural network to learn a representation of the context and then applies linear regression on the learned features.
- Model: Combines a neural network for feature extraction with a linear model for reward prediction.
- Exploration: Adds an exploration bonus based on the uncertainty of the linear model.
- Exploitation: Uses the predicted rewards from the linear model.
- Type: Contextual
- Description: The DecisionTreeBandit algorithm uses decision trees to model the relationship between context and rewards, allowing it to capture non-linear patterns.
- Model: Fits a decision tree regressor for each arm based on the observed contexts and rewards.
- Exploration: Relies on the decision tree's predictions; may require additional mechanisms for exploration.
- Exploitation: Selects the arm with the highest predicted reward.