Draft of the EnsembleMetric page

relari-ai · Jan 4, 2024 · 391a22f · 391a22f
1 parent eb71e6e
commit 391a22f
Show file tree

Hide file tree

Showing 3 changed files with 191 additions and 95 deletions.
diff --git a/docs/astro.config.mjs b/docs/astro.config.mjs
@@ -9,75 +9,73 @@ export default defineConfig({
     remarkPlugins: [remarkMath],
     rehypePlugins: [rehypeMathjax],
   },
-	integrations: [
-		starlight({
-			title: 'Continuous Eval',
-			tableOfContents: { minHeadingLevel: 2, maxHeadingLevel: 4, },
-			customCss: [
-				// Relative path to your custom CSS file
-				'./src/styles/custom.css',
-			],
-			social: {
-				github: 'https://github.com/relari-ai/continuous-eval/tree/main',
-			},
-			sidebar: [
-				{
-					label: 'Getting Started',
-					items: [
-						// Each item here is one entry in the navigation menu.
-						{ label: 'Introduction', link: '/getting-started/introduction/'},
-						{ label: 'Installation', link: '/getting-started/installation/' },
-						{ label: 'Quick Start', link: '/getting-started/quickstart/' },
-					],
-				},
-				{
-					label: 'Metrics',
-					items: [
-						{ label: 'Overview', link: '/metrics/overview/' },
-						{
-							label: 'Retrieval',
-							autogenerate: { directory: '/metrics/Retrieval/' }
-						},
-						{
-							label: 'Generation',
-							items: [
-								{
-									label: 'Deterministic',
-									autogenerate: { directory: '/metrics/Generation/Deterministic/' }
-								},
-								{
-									label: 'Semantic',
-									items: [
-										{ label: 'DeBERTa Answer Scores', link: '/metrics/generation/semantic/deberta_answer_scores/' },
-										{ label: 'BERT Answer Similarity', link: '/metrics/generation/semantic/bert_answer_similarity/' },
-										{ label: 'BERT Answer Relevance', link: '/metrics/generation/semantic/bert_answer_relevance/' },
-									]
-								},
-								{
-									label: 'LLM-Based',
-									autogenerate: { directory: '/metrics/Generation/LLM-Based/' }
-								},
-							]
-						},
-					],
-				},
-				{
-					label: 'Datasets',
-					autogenerate: { directory: 'evaluators' },
-				},
-				{
-					label: 'Metric Ensembling',
-					items: [
-						{ label: 'Conformal Prediction', link: '/classification/conformal_prediction/' },
-						{ label: 'Classification', link: '/classification/classifier/' },
-					],
-				},
-
-        		{
-					label: 'Examples',
-					autogenerate: { directory: 'examples' },
-				},
-			],
-		}),
-	],
+  integrations: [
+    starlight({
+      title: 'Continuous Eval',
+      tableOfContents: { minHeadingLevel: 2, maxHeadingLevel: 4, },
+      customCss: [
+        // Relative path to your custom CSS file
+        './src/styles/custom.css',
+      ],
+      social: {
+        github: 'https://github.com/relari-ai/continuous-eval/tree/main',
+      },
+      sidebar: [
+        {
+          label: 'Getting Started',
+          items: [
+            // Each item here is one entry in the navigation menu.
+            { label: 'Introduction', link: '/getting-started/introduction/' },
+            { label: 'Installation', link: '/getting-started/installation/' },
+            { label: 'Quick Start', link: '/getting-started/quickstart/' },
+          ],
+        },
+        {
+          label: 'Metrics',
+          items: [
+            { label: 'Overview', link: '/metrics/overview/' },
+            {
+              label: 'Retrieval',
+              autogenerate: { directory: '/metrics/Retrieval/' }
+            },
+            {
+              label: 'Generation',
+              items: [
+                {
+                  label: 'Deterministic',
+                  autogenerate: { directory: '/metrics/Generation/Deterministic/' }
+                },
+                {
+                  label: 'Semantic',
+                  items: [
+                    { label: 'DeBERTa Answer Scores', link: '/metrics/generation/semantic/deberta_answer_scores/' },
+                    { label: 'BERT Answer Similarity', link: '/metrics/generation/semantic/bert_answer_similarity/' },
+                    { label: 'BERT Answer Relevance', link: '/metrics/generation/semantic/bert_answer_relevance/' },
+                  ]
+                },
+                {
+                  label: 'LLM-Based',
+                  autogenerate: { directory: '/metrics/Generation/LLM-Based/' }
+                },
+              ]
+            },
+          ],
+        },
+        {
+          label: 'Datasets',
+          autogenerate: { directory: 'evaluators' },
+        },
+        {
+          label: 'Metric Ensembling',
+          items: [
+            { label: 'Classification', link: '/classification/classifier/' },
+          ],
+        },
+        {
+          label: 'Examples',
+          autogenerate: { directory: 'examples' },
+        },
+      ],
+    }),
+  ],
 });
diff --git a/docs/src/content/docs/classification/classifier.md b/docs/src/content/docs/classification/classifier.md
@@ -1,10 +1,59 @@
 ---
-title: Classifier
-description: Use metric to predict human evaluator.
+title: Metric Ensembling
+sidebar:
+  order: 2
 ---
 
+
+The aim of ensembling different metrics to predict the human label is to combine the strengths and balance out the weaknesses of individual metrics, ultimately leading to more accurate, robust, and reliable predictions. 
+
+Each metric might capture different aspects of the data or be sensitive to different patterns, so when we combine them, we often get a more comprehensive view.
+
+## What is Conformal Prediction?
+
+Conformal Prediction is a statistical technique that quantifies the confidence level of a prediction.
+In this case, we are trying to predict whether the answer is correct (or faithful).
+With conformal prediction, instead of just saying “yes” (or “no”), the model tells us “the answer is correct with probability at least 90%”.
+In essence, conformal prediction doesn’t just give you an answer; it tells you how confident you can be in that answer.
+If the model is uncertain, conformal prediction will tell you it’s “undecided”. 
+For the undecided datapoints, we ask the more powerful GPT-4 to judge its correctness.
+
+## Metric Ensembling
+
+The `MetricEnsemble` class helps you to ensemble multiple metrics to predict a ground truth label, such us human labels.
+The class leverage the conformal prediction technique to compute a reliable 
+
+Parameters:
+
+- `training: XYData`: training data, it should contain `training.X` (the metrics output, also referred as _features_) and `training.Y` (the ground truth label)
+- `calibration: XYData`: as before but used for the calibration of the conformal predictor
+- `alpha: float`: significance level, default to 0.1. The significance level os the probability that a prediction will not be included in the predicted set, serving as a measure of the confidence or reliability of the prediction. For example if alpha is 0.1, then the prediction set will contain the correct label with probability 0.9.
+- `random_state: Optional[int]`: random seed, default to None
+
+The `MetricEnsemble` class has the following methods:
+
+- `predict(self, X: pd.DataFrame, judicator: Optional[Callable] = None)`: it takes as input a dataframe of metrics output and returns a dataframe of predictions
+
+The `predict` returns two numpy vectors:
+
+- `y_hat` a binary (1/0) vector with best-effort predictions of the ensemble
+- `y_set` a binary array of size (N, 2) where the first column is 1 is, for the significance level set by `alpha`, the sample can be classified as negative, and the second column is 1 if the sample can be classified as positive.
+
+The set prediction (`y_set`) can have both columns set to 1, meaning that the ensemble is undecided.
+This happen because the particular choice of metrics in the ensemble is not confident enough or the significance level is too high.
+In such cases the `predict` method will call the `judicator` function (if not `None`) to make a final decision.
+
+The `judicator` function takes as input the index of the sample where the predictor is undecided and must return a boolean value (True/False) indicating the final decision.
+
+### Example
+
+In this exampel we want to use deterministic and semantic metrics to predict the correctness of the answers (as evaluated by a human annotator).
+When these two metrics alone are not sufficient to produce a confident prediction, we use the LLM to make the final decision.
+
+As first thing we compute the deterministic and semantic metrics:
+
 ```python
-time import perf_counter
+from time import perf_counter
 
 import numpy as np
 import pandas as pd
@@ -29,36 +78,100 @@ evaluator = GenerationEvaluator(
         DebertaAnswerScores(),
     ],
 )
-evaluator.run(batch_size=100)
+results = evaluator.run(batch_size=100)
 toc = perf_counter()
 print(f"Evaluation completed in {toc - tic:.2f}s")
 print(set(evaluator.aggregated_results))
 
 evaluator.save("det_sem.jsonl")  # Save for future use...
+```
+
+We now split the samples in train, test, and calibration sets and train the classifier.
+Note that we are using only the `"token_overlap_recall"`,`"deberta_answer_entailment"`, and `"deberta_answer_contradiction"` to train the classifier. 
 
-# Now let's use the results to train a classifier to predict the correctness of the answers (as evaluated by a human annotator)
-X = pd.DataFrame(evaluator.results)
+```python
+
+# X is the set of metrics we want to use to predict the correctness of the answer
+X = pd.DataFrame(results)
+
+# y is the ground truth label
 y = dataset["annotation"].map({"correct": 1, "incorrect": 0}).astype(int).to_numpy()
 
 # We split the dataset into train, test, and calibration sets
 datasplit = DataSplit(
     X=X,
     y=y,
+    dataset=dataset,
     split_ratios=SplitRatios(train=0.6, test=0.2, calibration=0.2),
     features=[
-        "token_recall",
+        "token_overlap_recall",
         "deberta_answer_entailment",
         "deberta_answer_contradiction",
     ],
     oversample=True,
 )
 
-# We use the train and calibration sets to train the classifier
 clf = EnsembleMetric(training=datasplit.train, calibration=datasplit.calibration)
+```
+
+Finally we run the classifier and evaluate the results:
 
-# We then use the test set to evaluate the classifier
+```python
+tic = perf_counter()
 y_hat, y_set = clf.predict(datasplit.test.X)
+toc = perf_counter()
+print(f"Prediction completed in {(toc - tic)*1000:.2f}ms")
+
+# Evaluation
 num_undecided = np.sum(np.all(y_set, axis=1))
 print(eval_prediction(datasplit.test.y, y_hat))
 print(f"Undecided: {num_undecided} ({num_undecided/len(y_set):.2%})")
 ```
+
+The output would be something like:
+
+```text
+Prediction completed in 2.36ms
+{'precision': 0.9627329192546584, 'recall': 0.824468085106383, 'f1': 0.8882521489971348, 'accuracy': 0.8340425531914893}
+Undecided: 61 (25.96%)
+```
+
+#### Using a judicator
+
+Let's assume we want to use the LLM to make the final decision when the classifier is undecided.
+We can define a `judicator` function that takes as input the index of the sample where the classifier is undecided and returns a boolean value (True/False) indicating the final decision.
+
+```python
+# We use the train and calibration sets to train the classifier
+llm_metric = LLMBasedAnswerCorrectness(LLMFactory("gpt-4-1106-preview"))
+
+
+def judicator(idx):
+    # The judicator receives the index of the example in the test set where the classifier is undecided
+    # and in this case, since we are computing the correctness of the sample,
+    # it returns True if the example is correct and False otherwise
+    datum = datasplit.test_full.X.iloc[idx].to_dict()
+    return llm_metric.calculate(**datum)["LLM_based_answer_correctness"] >= 3
+```
+
+To use the judicator we simply pass it to the `predict` method:
+
+```python
+y_hat, y_set = clf.predict(datasplit.test.X, judicator=judicator)
+
+print(f"Prediction completed in {toc - tic:.2f}s")
+num_undecided = np.sum(np.all(y_set, axis=1))
+print(eval_prediction(datasplit.test.y, y_hat))
+print(f"Undecided: {num_undecided} ({num_undecided/len(y_set):.2%})")
+```
+
+The output would be something like:
+
+```text
+Prediction completed in 245.73s
+{'precision': 0.9818181818181818, 'recall': 0.8617021276595744, 'f1': 0.9178470254957507, 'accuracy': 0.8765957446808511}
+Undecided: 0 (0.00%)
+```
+
+Here the `predict` function called the LLM in the _25.96%_ of the cases where the classifier was undecided.
+The classifier is no longer undecided and the performance improved but the prediction time increased from _2.36ms_ to _245.73s_.
diff --git a/docs/src/content/docs/classification/conformal_prediction.md b/docs/src/content/docs/classification/conformal_prediction.md