Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

COMPAS fairness analysis end-to-end example #52

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions Project.toml
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ Statistics = "10745b16-79ce-11e8-11f9-7d13ad32a3b2"
StatsBase = "2913bbd2-ae8a-5f71-8c99-4fb6c76f3a91"
StatsPlots = "f3b207a7-027a-5e70-b257-86293d7955fd"
Tables = "bd369af6-aec1-5ad0-b16a-f7cc5008161c"
VegaLite = "112f6efa-9a02-5b7d-90c0-432ed331239a"
XGBoost = "009559a3-9522-5dbb-924b-0b6ed2b22bb9"

[compat]
Expand Down
1 change: 1 addition & 0 deletions _layout/head.html
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,7 @@
<li class="pure-menu-item "><a href="/end-to-end/HouseKingCounty/" class="pure-menu-link"><span style="padding-right:0.5rem;">•</span> King County Houses</a></li>
<li class="pure-menu-item "><a href="/end-to-end/airfoil" class="pure-menu-link"><span style="padding-right:0.5rem;">•</span> Airfoil </a></li>
<li class="pure-menu-item "><a href="/end-to-end/boston-lgbm" class="pure-menu-link"><span style="padding-right:0.5rem;">•</span> Boston (lgbm) </a></li>
<li class="pure-menu-item "><a href="/end-to-end/COMPAS" class="pure-menu-link"><span style="padding-right:0.5rem;">•</span> COMPAS </a></li>
</ul>
</ul>
<!-- END OF LIST OF MENU ITEMS -->
Expand Down
135 changes: 135 additions & 0 deletions _literate/EX-COMPAS.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,135 @@
# This fairness analysis of COMPAS dataset has been adapted partly from the [COMPAS analysis by Aequitas](https://dssg.github.io/aequitas/examples/compas_demo.html)

# ## Introduction to fairness and bias analysis
#
# Recent work in the Machine Learning community has raised concerns about the risk of unintended bias in Algorithmic Decision-Making systems, affecting individuals unfairly. While many bias metrics and fairness definitions have been proposed in recent years, the community has not reached a consensus on which definitions and metrics should be used, and there has been very little empirical analyses of real-world problems using the proposed metrics.

# ## COMPAS Dataset
#
# Northpointe’s COMPAS (Correctional Offender Management Profiling for Alternative Sanctions) is one of the most widesly utilized risk assessment tools/ algorithms within the criminal justice system for guiding decisions such as how to set bail. The ProPublica dataset represents two years of COMPAS predicitons from Broward County, FL.

# ## Getting started

using DataFrames, MLJ, CSV, VegaLite
using HTTP

MLJ.color_off() # hide

req = HTTP.get("https://raw.githubusercontent.com/dssg/aequitas/master/examples/data/compas_for_aequitas.csv")

df = CSV.read(req.body)
df[1:5, :] |> pretty

#

schema(df)

#

df = coerce(df, Textual=>OrderedFactor)
df = coerce(df, :score=>Count)
schema(df)

# ## Levels of recidivism

df |>
@vlplot(
:bar,
width=50,
height=50,
column="race:o",
y={"count()", axis={title="count", grid=false}},
x={"label_value:n", axis={title=""}},
color={"label_value:n", scale={range=["#675193", "#ca8861"]}},
spacing=10,
config={
view={stroke=:transparent},
axis={domainWidth=1}
}
) |> save(joinpath(@OUTPUT,"COMPAS_plot1.svg"))

# \figalt{Levels of recidivism}{COMPAS_plot1.svg}

# ## Model Training
#
# Now we will train a AdaBoostClassifier to predict the label_value. In this tutorial we will be training only on entity_id, age, sex and race. The actual COMPAS Dataset contains multiple columns. But for simplicity, we will be training only on these 4 values.


# ## Data Preprocessing
#
# We unpack our dataframe, convert our target labels to categorical. Then we use the Transformer:OneHotEncoder provided by MLJ.

y, X = unpack(df, ==(:label_value), col -> true);

y = categorical(y);

X = X[[:entity_id, :race, :sex, :age_cat]]
X = coerce(X, Count=>Continuous);

X = transform(fit!(machine(OneHotEncoder(), X)), X);

train, test = partition(eachindex(y), 0.7, shuffle=true);

schema(X)

#

aboost = @load AdaBoostClassifier pkg=ScikitLearn
aboost_m = machine(aboost, X, y);
fit!(aboost_m, rows=train);
pred_aboost = MLJ.predict(aboost_m, rows=test);

# Each value in pred_aboost is UnivariateFinite with predicted probability of each label. To simplify the discussion, we now convert pred_aboost to a simple array where the label with higher probability is chosen.

y_pred = Array{Int64, 1}(undef, 2164);

for i in range(1, stop=length(pred_aboost))
y_pred[i] = pred_aboost[i].prob_given_class[1]>0.5 ? 0 : 1
end

# Now we create a DataFrame of test rows and create a new column for the predictions the model made.

df_test = df[test, :]

insertcols!(df_test, 2, :pred=>y_pred);

schema(df_test)

# ## Plot of the count of predicted labels for each value of race

df_test |>
@vlplot(
:bar,
width=50,
height=50,
column="race:o",
y={"count()", axis={title="count", grid=false}},
x={"pred:n", axis={title=""}},
color={"pred:n", scale={range=["#675193", "#ca8861"]}},
spacing=10,
config={
view={stroke=:transparent},
axis={domainWidth=1}
}
) |> save(joinpath(@OUTPUT,"COMPAS_plot2.svg"))

# \figalt{count of predicted labels}{COMPAS_plot2.svg}

# ## Fairness Metrics
#
# Now we find the values of False Negative Rate, False Positive Rate, True Negative Rate and True Positive Rate. Values of other metrics like Equal Opportunity Score, etc can be calculated

for r in ["African-American", "Caucasian", "Hispanic"]
indices = [x==r for x in df_test[:race]]
ŷ = df_test[indices, :pred]
ŷ = convert(CategoricalArray, ŷ)
y_test = convert(CategoricalArray, y[test])
println("Printing values for the race : ", r)
println("False Negative Rate : ", false_negative_rate(ŷ, y_test[indices]))
println("False Positive Rate : ", false_positive_rate(ŷ, y_test[indices]))
println("True Negative Rate : ", true_negative_rate(ŷ, y_test[indices]))
println("True Positive Rate : ", true_positive_rate(ŷ, y_test[indices]))
println()
end

# Above analysis was performed on the sensitive attribute : race. Similar analysis could also be performed on the other protected attributes : Sex and Age
6 changes: 6 additions & 0 deletions end-to-end/COMPAS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
@def hascode = true
@def showall = true

# COMPAS Fairness analysis

\tutorial{EX-COMPAS}