fill in documentation in NLP notebook

mitre · Nov 8, 2023 · 84f8204 · 84f8204
1 parent 5af2ce7
commit 84f8204
Showing 1 changed file with 93 additions and 41 deletions.
diff --git a/docs/source/examples/nlp/wilds_datasets.ipynb b/docs/source/examples/nlp/wilds_datasets.ipynb
@@ -4,25 +4,22 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "__This code is experimental. Some notable issues__\n",
-    "- transforms are very slow on even moderate batch sizes\n",
-    "- detection scheme design is not finalized"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "#### Overview \n",
+    "## Overview\n",
+    "\n",
+    "This example demonstrates an experimental NLP-based drift detection algorithm. It uses the \"Civil Comments\" dataset ([link](https://github.com/p-lambda/wilds/blob/main/wilds/datasets/civilcomments_dataset.py) to a Python loading script with additional details/links) from the `wilds` library, which contains online comments meant to be used in toxicity classification problems.\n",
     "\n",
-    "This notebook is a work in progress. Eventually, the contents will demonstrate an NLP-based drift detection algorithm in action, but until the feature is developed, it shows the loading and use of two datasets to be used in the examples:\n",
+    "This example and the experimental modules often pull directly and indirectly from [`alibi-detect`](https://github.com/SeldonIO/alibi-detect/tree/master) and its own [example(s)](https://docs.seldon.io/projects/alibi-detect/en/stable/examples/cd_text_imdb.html).\n",
     "\n",
-    "- Civil Comments dataset: online comments to be used in toxicity classification problems \n",
-    "- Amazon Reviews dataset: amazon reviews to be used in a variety of NLP problems\n",
+    "## Notes\n",
     "\n",
-    "The data is accessed by using the `wilds` library, which contains several such datasets and wraps them in an API as shown below. \n",
+    "This code is experimental, and has notable issues:\n",
+    "- transform functions are very slow, on even moderate batch sizes\n",
+    "- detector design is not generalized, and may not work on streaming problems, or with data representations of different types/shapes\n",
+    "- some warnings below are not addressed\n",
     "\n",
-    "#### Imports"
+    "## Imports\n",
+    "\n",
+    "Code (transforms, alarm, detector) is pulled from the experimental module in `menelaus`, which is live but not fully tested. Note that commented code shows `wilds` modules being used to access and save the dataset to disk, but are excluded to save time. The example hence assumes the dataset is locally available."
    ]
   },
   {
@@ -52,9 +49,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "#### Load Data\n",
+    "## Load Data\n",
     "\n",
-    "Note that initially, the large data files need to be downloaded first. Later examples may assume the data is already stored to disk."
+    "Since some of the experimental modules are not very performant, the dataset is loaded and then limited to the first 300 data points (comments), which are split into three sequential batches of 100."
    ]
   },
   {
@@ -72,18 +69,71 @@
     "batch3 = dataset_civil[200:300]"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Transforms Pipeline\n",
+    "\n",
+    "The major step is to initialize the transform functions that will be applied to the comments, to turn them into detector-compatible representations. \n",
+    "\n",
+    "First, the comments must be tokenized:\n",
+    "- set up an `AutoTokenizer` model from the `transformers` library with a convenience function, by specifying the desired model name and other arguments\n",
+    "- the convenience function lets the configured tokenizer be called repeatedly, using batch 1 as the training data\n",
+    "\n",
+    "Then, the tokens must be made into embeddings:\n",
+    "- an initial transform function uses a `transformers` model to extract embeddings from given tokens\n",
+    "- the subsequent transform function reduces the dimension via an `UntrainedAutoEncoder` to a manageable size"
+   ]
+  },
   {
    "cell_type": "code",
-   "execution_count": 3,
+   "execution_count": 6,
    "metadata": {},
    "outputs": [
     {
      "name": "stderr",
      "output_type": "stream",
      "text": [
       "c:\\Users\\ASRIVASTAVA\\Documents\\repos\\menelaus\\venv\\lib\\site-packages\\transformers\\tokenization_utils_base.py:2418: FutureWarning: The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).\n",
-      "  warnings.warn(\n",
-      "Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight']\n",
+      "  warnings.warn(\n"
+     ]
+    }
+   ],
+   "source": [
+    "# tokens \n",
+    "tokenizer = auto_tokenize(model_name='bert-base-cased', pad_to_max_length=True, return_tensors='tf')\n",
+    "tokens = tokenizer(data=batch1)\n",
+    "\n",
+    "# embedding (TODO abstract this layers line)\n",
+    "layers = [-_ for _ in range(1, 8 + 1)]\n",
+    "embedder = extract_embedding(model_name='bert-base-cased', embedding_type='hidden_state', layers=layers)\n",
+    "\n",
+    "# dimension reduction via Untrained AutoEncoder\n",
+    "uae_reduce = uae_reduce_dimension(enc_dim=32)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Detector Setup\n",
+    "\n",
+    "Next a detector is setup. First, a `KolmogorovSmirnovAlarm` is initialized with default settings. When the amount of columns (which reject the null KS test hypothesis) exceeds the default ratio (0.25), this alarm will indicate drift has occurred. \n",
+    "\n",
+    "Then the detector is constructed. It is given the initialized alarm, and the ordered list of transforms configured above. The detector is then made to step through each available batch, and its state is printed as output. Note that the first batch establishes the reference data, the second establishes the test data, and the third will require recalibration (test is combined into reference) if drift is detected."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias']\n",
       "- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).\n",
       "- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).\n",
       "All the weights of TFBertModel were initialized from the PyTorch model.\n",
@@ -94,16 +144,16 @@
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "State after initial batch: baseline\n"
+      "\n",
+      "State after initial batch: baseline\n",
+      "\n"
      ]
     },
     {
      "name": "stderr",
      "output_type": "stream",
      "text": [
-      "c:\\Users\\ASRIVASTAVA\\Documents\\repos\\menelaus\\venv\\lib\\site-packages\\transformers\\tokenization_utils_base.py:2418: FutureWarning: The `pad_to_max_length` argument is deprecated and will be removed in a future version, use `padding=True` or `padding='longest'` to pad to the longest sequence in the batch, or use `padding='max_length'` to pad to a max length. In this case, you can give a specific length with `max_length` (e.g. `max_length=45`) or leave max_length to None to pad to the maximal input size of the model (e.g. 512 for Bert).\n",
-      "  warnings.warn(\n",
-      "Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight']\n",
+      "Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias']\n",
       "- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).\n",
       "- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).\n",
       "All the weights of TFBertModel were initialized from the PyTorch model.\n",
@@ -114,14 +164,16 @@
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "State after test batch alarm\n"
+      "\n",
+      "State after test batch: alarm\n",
+      "\n"
      ]
     },
     {
      "name": "stderr",
      "output_type": "stream",
      "text": [
-      "Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight']\n",
+      "Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias']\n",
       "- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).\n",
       "- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).\n",
       "All the weights of TFBertModel were initialized from the PyTorch model.\n",
@@ -132,35 +184,35 @@
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "State after new batch, recalibration alarm\n"
+      "\n",
+      "State after new batch, recalibration: alarm\n",
+      "\n"
      ]
     }
    ],
    "source": [
-    "# tokens \n",
-    "tokenizer = auto_tokenize(model_name='bert-base-cased', pad_to_max_length=True, return_tensors='tf')\n",
-    "tokens = tokenizer(data=batch1)\n",
-    "\n",
-    "# embedding (TODO abstract this layers line)\n",
-    "layers = [-_ for _ in range(1, 8 + 1)]\n",
-    "embedder = extract_embedding(model_name='bert-base-cased', embedding_type='hidden_state', layers=layers)\n",
-    "\n",
-    "# dimension reduction via Untrained AutoEncoder\n",
-    "uae_reduce = uae_reduce_dimension(enc_dim=32)\n",
-    "\n",
     "# detector + set reference\n",
     "ks_alarm = KolmogorovSmirnovAlarm()\n",
     "detector = Detector(alarm=ks_alarm, transforms=[tokenizer, embedder, uae_reduce])\n",
     "detector.step(batch1)\n",
-    "print(f\"State after initial batch: {detector.state}\")\n",
+    "print(f\"\\nState after initial batch: {detector.state}\\n\")\n",
     "\n",
     "# detector + add test (copy reference)  \n",
     "detector.step(batch2)\n",
-    "print(f\"State after test batch {detector.state}\")\n",
+    "print(f\"\\nState after test batch: {detector.state}\\n\")\n",
     "\n",
     "# recalibrate and re-evaluate (XXX - all batches must be same length)\n",
     "detector.step(batch3)\n",
-    "print(f\"State after new batch, recalibration {detector.state}\")"
+    "print(f\"\\nState after new batch, recalibration: {detector.state}\\n\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Final Notes\n",
+    "\n",
+    "We can see the baseline state after processing the initial batch, an alarm raised after observing test data, and then another alarm signal after a new test batch is observed and the reference is internally recalibrated."
    ]
   }
  ],