update readme

socialfoundations · Jul 4, 2024 · 7d0551d · 7d0551d
1 parent e44ed71
commit 7d0551d
Show file tree

Hide file tree

Showing 4 changed files with 326 additions and 55 deletions.
diff --git a/README.md b/README.md
@@ -116,13 +116,11 @@ By evaluating LLMs on tabular classification tasks, we can use standard feature
 
 You can do so yourself by calling `folktexts.cli.eval_feature_importance` (add `--help` for a full list of options).
 
-Here's an example for the Llama3-70B-Instruct model on the ACSIncome task:
+Here's an example for the Llama3-70B-Instruct model on the ACSIncome task (*warning: takes 24h on an Nvidia H100*):
 ```
 python -m folktexts.cli.eval_feature_importance --model 'meta-llama/Meta-Llama-3-70B-Instruct' --task ACSIncome --subsampling 0.1
 ```
-
-Here are the plotted results:
-![feat-imp_llama3-70b.png](feat-imp_llama3-70b.png)
+![feat-imp_llama3-70b.png](docs/_static/feat-imp_meta-llama--Meta-Llama-3-70B-Instruct.png)
 
 This script uses sklearn's [`permutation_importance`](https://scikit-learn.org/stable/modules/generated/sklearn.inspection.permutation_importance.html#sklearn.inspection.permutation_importance) to assess which features contribute the most for the ROC AUC metric (other metrics can be assessed using the `--scorer [scorer]` parameter).
 

diff --git a/docs/_static/feat-imp_meta-llama--Meta-Llama-3-70B-Instruct.png b/docs/_static/feat-imp_meta-llama--Meta-Llama-3-70B-Instruct.png
diff --git a/folktexts/benchmark.py b/folktexts/benchmark.py
@@ -75,8 +75,6 @@ def __hash__(self) -> int:
 class CalibrationBenchmark:
     """A benchmark class for measuring and evaluating LLM calibration."""
 
-    DEFAULT_BENCHMARK_METRIC = "ece"
-
     """
     Standardized configurations for the ACS data to use for benchmarking.
     """
@@ -260,7 +258,7 @@ def run(self, results_root_dir: str | Path, fit_threshold: int | bool = 0) -> fl
         # Save results to disk
         self.save_results()
 
-        return self._results[self.DEFAULT_BENCHMARK_METRIC]
+        return self._results
 
     def plot_results(self, *, show_plots: bool = True):
         """Render evaluation plots and save to disk.

diff --git a/notebooks/parse-feature-importance-results.ipynb b/notebooks/parse-feature-importance-results.ipynb