deleted hello stuff, added files

AC-BO-Hackathon · Mar 27, 2024 · fd642bf · fd642bf
1 parent 7d3d0ef
commit fd642bf
Show file tree

Hide file tree

Showing 7 changed files with 1,373 additions and 7 deletions.
diff --git a/baybe_hack.ipynb b/baybe_hack.ipynb
@@ -0,0 +1,362 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Imports"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "import pandas as pd\n",
+    "\n",
+    "from baybe.targets import NumericalTarget\n",
+    "from baybe.objective import Objective\n",
+    "\n",
+    "from baybe.parameters import NumericalDiscreteParameter, NumericalContinuousParameter\n",
+    "from baybe.searchspace import SearchSpace\n",
+    "\n",
+    "from baybe.recommenders import RandomRecommender, SequentialGreedyRecommender, NaiveHybridRecommender\n",
+    "from baybe.surrogates import GaussianProcessSurrogate\n",
+    "\n",
+    "from baybe.strategies import TwoPhaseStrategy\n",
+    "from baybe import Campaign"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Setting the objectives"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The DESIRABILITY mode enables the combination multiple targets via scalarization into a single value.\n",
+    "\n",
+    "See MATCH mode, instead of MAX/MIN + For more details on transformation functions: \n",
+    "https://emdgroup.github.io/baybe/userguide/targets.html"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Set targets/objectives, efficiency?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "\"\"\"\n",
+    "overpotential = NumericalTarget(\n",
+    "    name=\"overpotential\", \n",
+    "    mode=\"MAX\", \n",
+    "    bounds=(-400, 0),\n",
+    "    transformation=\"LINEAR\"  # optional, will be applied if bounds are not None, LINEAR only one available for MAX/MIN\n",
+    "    ) \n",
+    "\n",
+    "overpotential_slope = NumericalTarget(\n",
+    "    name=\"overpotential_slope\", \n",
+    "    mode=\"MAX\", \n",
+    "    bounds=(-0.05, 0.05),\n",
+    "    transformation=\"LINEAR\"  # optional, will be applied if bounds are not None, LINEAR only one available for MAX/MIN\n",
+    "    )\n",
+    "\n",
+    "objective = Objective(\n",
+    "    mode=\"DESIRABILITY\",\n",
+    "    targets=[overpotential, overpotential_slope],\n",
+    "    weights=[1.0, 1.0],  # optional, by default all weights are equal\n",
+    "    combine_func=\"GEOM_MEAN\",  # optional, geometric mean is the default\n",
+    ")\n",
+    "\"\"\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Search Space"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "parameters = [\n",
+    "NumericalDiscreteParameter(\n",
+    "    name=\"Time (h)\",\n",
+    "    values=np.arange(6, 25, 1) # Assuming time below 6 hours is discarded\n",
+    "),\n",
+    "NumericalDiscreteParameter(\n",
+    "        name=\"pH\",\n",
+    "        values=np.arange(-1, 15.1, 0.1)\n",
+    "    ),  \n",
+    "NumericalContinuousParameter( # Set this as continuous, the values seem quite small?\n",
+    "        name=\"Inhibitor Concentration (M)\",\n",
+    "        bounds=(0, 0.02)\n",
+    "    ),\n",
+    "NumericalDiscreteParameter(\n",
+    "        name=\"Salt Concentration (M)\",\n",
+    "        values=np.arange(0, 2.01, 0.01),\n",
+    "    )\n",
+    "]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Substance parameter**\n",
+    "\n",
+    "Instead of values, this parameter accepts data in form of a dictionary. The items correspond to pairs of labels and SMILES. SMILES are string-based representations of molecular structures. Based on these, BayBE can assign each label a set of molecular descriptors as encoding.\n",
+    "\n",
+    "For instance, a parameter corresponding to a choice of solvents can be initialized with:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from baybe.parameters import SubstanceParameter\n",
+    "\n",
+    "SubstanceParameter(\n",
+    "    name=\"Solvent\",\n",
+    "    data={\n",
+    "        \"Water\": \"O\",\n",
+    "        \"1-Octanol\": \"CCCCCCCCO\",\n",
+    "        \"Toluene\": \"CC1=CC=CC=C1\",\n",
+    "    },\n",
+    "    encoding=\"MORDRED\",  # optional\n",
+    "    decorrelate=0.7,  # optional\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "\n",
+    "The encoding option defines what kind of descriptors are calculated:\n",
+    "\n",
+    "MORDRED: 2D descriptors from the Mordred package. Since the original package is now unmaintained, baybe requires the community replacement mordredcommunity\n",
+    "\n",
+    "RDKIT: 2D descriptors from the RDKit package\n",
+    "\n",
+    "MORGAN_FP: Morgan fingerprints calculated with RDKit (1024 bits, radius 4)\n",
+    "\n",
+    "These calculations will typically result in 500 to 1500 numbers per molecule. **To avoid detrimental effects on the surrogate model fit, we reduce the number of descriptors via decorrelation before using them.** For instance, the decorrelate option in the example above specifies that only descriptors with a correlation lower than 0.7 to any other descriptor will be kept. This usually reduces the number of descriptors to 10-50, depending on the specific items in data.\n",
+    "\n",
+    "**WARNING:**\n",
+    "The descriptors calculated for a SubstanceParameter were developed to describe small molecules and are not suitable for other substances. If you deal with large molecules like polymers or arbitrary substance mixtures, we recommend to provide your own descriptors via the CustomParameter."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The encoding concept introduced above is generalized by the CustomParameter. Here, the user is expected to provide their own descriptors for the encoding.\n",
+    "\n",
+    "Take, for instance, a parameter that corresponds to the choice of a polymer. Polymers are not well represented by the small molecule descriptors utilized in the SubstanceParameter. Still, one could provide experimental measurements or common metrics used to classify polymers:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "from baybe.parameters import CustomDiscreteParameter\n",
+    "\n",
+    "# Create or import new dataframe containing custom descriptors\n",
+    "\n",
+    "descriptors = pd.DataFrame(\n",
+    "    {\n",
+    "        \"Glass_Transition_TempC\": [20, -71, -39],\n",
+    "        \"Weight_kDalton\": [120, 32, 241],\n",
+    "    },\n",
+    "    index=[\"Polymer A\", \"Polymer B\", \"Polymer C\"],  # put labels in the index\n",
+    ")\n",
+    "\n",
+    "CustomDiscreteParameter(\n",
+    "    name=\"Polymer\",\n",
+    "    data=descriptors,\n",
+    "    decorrelate=True,  # optional, uses default correlation threshold = 0.7?\n",
+    ")\n",
+    "\n",
+    "# Add this to the parameters list afterwards"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "searchspace = SearchSpace.from_product(parameters)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Recommenders"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The **SequentialGreedyRecommender** is a powerful recommender that leverages BoTorch optimization functions to perform sequential Greedy optimization. It can be applied for discrete, continuous and hybrid sarch spaces. It is an implementation of the BoTorch optimization functions for discrete, continuous and mixed spaces. **It is important to note that this recommender performs a brute-force search when applied in hybrid search spaces, as it optimizes the continuous part of the space while exhaustively searching choices in the discrete subspace.** You can customize this behavior to only sample a certain percentage of the discrete subspace via the sample_percentage attribute and to choose different sampling strategies via the hybrid_sampler attribute. \n",
+    "\n",
+    "e.g.\n",
+    "strategy = TwoPhaseStrategy(recommender=SequentialGreedyRecommender(hybrid_sampler=\"Farthest\", sampling_percentage=0.3))\n",
+    "\n",
+    "The **NaiveHybridRecommender** can be applied to all search spaces, but is intended to be used in hybrid spaces. This recommender **combines individual recommenders for the continuous and the discrete subspaces. It independently optimizes each subspace and consolidates the best results to generate a candidate for the original hybrid space.** "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "For implementing fully customized surrogate models e.g. from sklearn or PyTorch, see:\n",
+    "https://emdgroup.github.io/baybe/examples/Custom_Surrogates/Custom_Surrogates.html\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "c:\\Users\\lordm\\Desktop\\Projects\\baybe\\.venv\\lib\\site-packages\\baybe\\recommenders\\bayesian.py:492: UserWarning: The value of 'allow_recommending_already_measured' differs from what is specified in the discrete recommender. The value of the discrete recommender will be ignored.\n",
+      "  warnings.warn(\n"
+     ]
+    }
+   ],
+   "source": [
+    "available_surr_models = [\n",
+    "    \"GaussianProcessSurrogate\", \n",
+    "    \"BayesianLinearSurrogate\",\n",
+    "    \"MeanPredictionSurrogate\",\n",
+    "    \"NGBoostSurrogate\",\n",
+    "    \"RandomForestSurrogate\"\n",
+    "]\n",
+    "\n",
+    "available_acq_functions = [\n",
+    "    \"qPI\",  # q-Probability Of Improvement\n",
+    "    \"qEI\",  # q-Expected Improvement\n",
+    "    \"qUCB\", # q-upper confidence bound with beta of 1.0\n",
+    "]\n",
+    "\n",
+    "# Defaults anyway\n",
+    "SURROGATE_MODEL = GaussianProcessSurrogate()\n",
+    "ACQ_FUNCTION = \"qEI\" # q-Expected Improvement, only q-fuctions are available for batch_size > 1\n",
+    "\n",
+    "seq_greedy_recommender = SequentialGreedyRecommender(\n",
+    "        surrogate_model=SURROGATE_MODEL,\n",
+    "        acquisition_function_cls=ACQ_FUNCTION,\n",
+    "        hybrid_sampler=\"Farthest\", # find more details in the documentation\n",
+    "        sampling_percentage=0.3, # should be relatively low\n",
+    "        allow_repeated_recommendations=False,\n",
+    "        allow_recommending_already_measured=False,\n",
+    "    )\n",
+    "\n",
+    "hybrid_recommender = NaiveHybridRecommender(\n",
+    "    allow_repeated_recommendations=False,\n",
+    "    allow_recommending_already_measured=False\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Campaign Strategy"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "strategy = TwoPhaseStrategy(\n",
+    "    initial_recommender = RandomRecommender(),  # Initial recommender, if no training data is available\n",
+    "    # Other initial recommenders don't seem to work for my hybrid search space/set of parameters\n",
+    "    # Doesn't matter since I already have training data\n",
+    "    recommender = seq_greedy_recommender,  # Bayesian model-based optimization\n",
+    "    # recommender = hybrid_recommender,\n",
+    "    switch_after=1  # Switch to the model-based recommender after 1 batch or iteration (so the initial training data)\n",
+    ")\n",
+    "\n",
+    "campaign = Campaign(searchspace, objective, strategy)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Import and read modified Excel file as dataframe? - Now containing only specific columns as training data - as in possibly this example: \n",
+    "\n",
+    "https://emdgroup.github.io/baybe/examples/Backtesting/full_initial_data.html\n",
+    "\n",
+    "\n",
+    "https://emdgroup.github.io/baybe/examples/Backtesting/full_lookup.html"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### For transfer learning see: \n",
+    "\n",
+    "https://emdgroup.github.io/baybe/userguide/transfer_learning\n",
+    "\n",
+    "&\n",
+    "\n",
+    "https://emdgroup.github.io/baybe/examples/Transfer_Learning/basic_transfer_learning.html"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": ".venv",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.11"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}