Skip to content

Commit

Permalink
minor readme update
Browse files Browse the repository at this point in the history
  • Loading branch information
AndreFCruz committed Jul 4, 2024
1 parent 1d1c8bf commit 203182f
Show file tree
Hide file tree
Showing 4 changed files with 344 additions and 55 deletions.
24 changes: 20 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,8 @@
![PyPI - License](https://img.shields.io/pypi/l/folktexts)
![Python compatibility](https://badgen.net/pypi/python/folktexts)


Folktexts is a python package to compute and evaluate classification risk scores
using large language models.
It enables using any transformers model as a classifier for tabular data tasks.
Folktexts is a python package to evaluate statistical properties of LLMs as classifiers.
It enables computing and evaluating classification _risk scores_ for tabular prediction tasks using LLMs.

Several benchmark tasks are provided based on data from the American Community Survey.
Namely, each prediction task from the popular
Expand All @@ -23,6 +21,7 @@ Package documentation can be found [here](https://socialfoundations.github.io/fo
- [Installing](#installing)
- [Basic setup](#basic-setup)
- [Example usage](#example-usage)
- [Evaluating feature importance](#evaluating-feature-importance)
- [Benchmark options](#benchmark-options)
- [License and terms of use](#license-and-terms-of-use)

Expand Down Expand Up @@ -111,6 +110,23 @@ clf.predict(dataset)
LLMClassifier (maybe the above code is fine for this), the benchmark, and
creating a custom ACS prediction task -->

## Evaluating feature importance

By evaluating LLMs on tabular classification tasks, we can use standard feature importance methods to assess which features the model uses to compute risk scores.

You can do so yourself by calling `folktexts.cli.eval_feature_importance` (add `--help` for a full list of options).

Here's an example for the Llama3-70B-Instruct model on the ACSIncome task (*warning: takes 24h on an Nvidia H100*):
```
python -m folktexts.cli.eval_feature_importance --model 'meta-llama/Meta-Llama-3-70B-Instruct' --task ACSIncome --subsampling 0.1
```
<div style="text-align: center;">
<img src="docs/_static/feat-imp_meta-llama--Meta-Llama-3-70B-Instruct.png" alt="feature importance on llama3 70b it" width="50%">
</div>

This script uses sklearn's [`permutation_importance`](https://scikit-learn.org/stable/modules/generated/sklearn.inspection.permutation_importance.html#sklearn.inspection.permutation_importance) to assess which features contribute the most for the ROC AUC metric (other metrics can be assessed using the `--scorer [scorer]` parameter).


## Benchmark options

```
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 1 addition & 3 deletions folktexts/benchmark.py
Original file line number Diff line number Diff line change
Expand Up @@ -75,8 +75,6 @@ def __hash__(self) -> int:
class CalibrationBenchmark:
"""A benchmark class for measuring and evaluating LLM calibration."""

DEFAULT_BENCHMARK_METRIC = "ece"

"""
Standardized configurations for the ACS data to use for benchmarking.
"""
Expand Down Expand Up @@ -260,7 +258,7 @@ def run(self, results_root_dir: str | Path, fit_threshold: int | bool = 0) -> fl
# Save results to disk
self.save_results()

return self._results[self.DEFAULT_BENCHMARK_METRIC]
return self._results

def plot_results(self, *, show_plots: bool = True):
"""Render evaluation plots and save to disk.
Expand Down
371 changes: 323 additions & 48 deletions notebooks/parse-feature-importance-results.ipynb

Large diffs are not rendered by default.

0 comments on commit 203182f

Please sign in to comment.