The package
In this section, we first give an overview of related software. Next, we give a very brief introduction to the ecosystem of packages. Finally, the implemented extensions for fairness are presented.
The mlr3 ecosystem
is tightly integrated into the ecosystem of packages around the ML framework (Lang et al. 2019). provides the infrastructure to fit, resample, and evaluate over 100 ML algorithms using a unified API. Packages from the ecosystem can be installed and updated via the (Lang and Schratz 2023) package. Multiple extension packages bring numerous additional advantages and extra functionality. In the context of fairness, the following extension packages deserve special mention:
- (Binder et al. 2021) for pre- and postprocessing via pipelining. This allows composing bias mitigation techniques with arbitrary ML algorithms shipped with as well as fusing ML algorithms with pre-processing steps such as imputation or class balancing. It furthermore integrates with (Pfisterer et al. 2021), which implements additional bias mitigation methods. We present an example in the supplementary material.
- and (Becker, Richter, et al. 2023) for its extensive tuning capabilities.
- (Sonabend et al. 2021) for survival analysis.
- for post-hoc analysis of benchmarked approaches.
- as a connector to OpenML (Vanschoren et al. 2014), an online scientific platform for collaborative ML.
In order to provide the required understanding for , we briefly introduce some terminology and syntax. A full introduction can be found in the mlr3 book (Bernd Bischl 2024).
A Task in is a basic building block holding the data,
storing covariates and the target variable along with some
meta-information. The shorthand constructor function tsk()
can be used to quickly access example tasks shipped with or . In the
following chunk, we retrieve the binary classification task with id
"adult_train" from the package. It contains a part of the
Adult data set (Dua and Graff
2017). The task is to predict whether an individual earns more
than $50.000 per year. The column "sex" is set as a binary
sensitive attribute with levels "Female" and
"Male".
library("mlr3verse")
library("mlr3fairness")
task = tsk("adult_train")
print(task)
##
## ── <TaskClassif> (30718x13) ──────────────────────────────────────────────────────────────────
## • Target: target
## • Target classes: <=50K (positive class, 75%), >50K (25%)
## • Properties: twoclass
## • Features (12):
## • fct (7): education, marital_status, occupation, race, relationship, sex, workclass
## • int (5): age, capital_gain, capital_loss, education_num, hours_per_week
## • Protected attribute: sex
The second building block is the Learner. It is a
wrapper around an ML algorithm, e.g., an implementation of logistic
regression or a decision tree. It can be trained on a Task
and used for obtaining a Prediction on an independent test
set which can subsequently be scored using a Measure to get
an estimate for the predictive performance on new data. The shorthand
constructors lrn() and msr() allow for the
instantiation of implemented Learners and
Measures, respectively. In the following example, we will
first instantiate a learner, then split our data into a train and test
set, afterwards train it on the train set of the dataset and finally
evaluate predictions on held-out test data. The train-test split in this
case is given by row indices, here stored in the idx
variable.
learner = lrn("classif.rpart", predict_type = "prob")
idx = partition(task)
learner$train(task, idx$train)
prediction = learner$predict(task, idx$test)
We then employ the classif.acc measure which measures
the accuracy of a prediction compared to the true label:
measure = msr("classif.acc")
prediction$score(measure)
## classif.acc
## 0.8443
In the example above, we obtain an accuracy score of 0.8443, meaning
our ML model correctly classifies roughly 84 % of the samples in the
test data. As the split into training set and test set is stochastic,
the procedure should be repeated multiple times for smaller datasets
(Bischl et al. 2012) and the resulting
performance values should be aggregated. This process is called
resampling, and can easily be performed with the resample()
function, yielding a ResampleResult object. In the
following, we employ 10-fold cross-validation as a resampling
strategy:
resampling = rsmp("cv", folds = 10)
rr = resample(task, learner, resampling)
## Warning: package 'future' was built under R version 4.5.2
## Warning: package 'mlr3fselect' was built under R version 4.5.2
We can call the aggregate method on the
ResampleResult to obtain the accuracy aggregated across all
\(10\) replications.
rr$aggregate(measure)
## classif.acc
## 0.8408
Here, we obtain an accuracy of 0.8408, so slightly higher than
previous scores, due to using a larger fraction of the data.
Furthermore, this estimate has a lower variance (as it is an aggregate)
at the cost of additional computation time. To properly compare
competing modelling approaches, candidates can be benchmarked against
each other using the benchmark() function (yielding a
BenchmarkResult). In the following, we compare the decision
tree from above to a logistic regression model. To do this, we use the
benchmark_grid function to compare the two
Learners across the same Task and resampling
procedure. Finally, we aggregate the measured scores each learner
obtains on each cross-validation split using the
$aggregate() function.
learner2 = lrn("classif.log_reg", predict_type = "prob")
grid = benchmark_grid(task, list(learner, learner2), resampling)
bmr = benchmark(grid)
bmr$aggregate(measure)[, .(learner_id, classif.acc)]
## learner_id classif.acc
## <char> <num>
## 1: classif.rpart 0.8408
## 2: classif.log_reg 0.8469
After running the benchmark, we can again call
.$aggregate to obtain aggregated scores. The package comes
with several ready-made visualizations for objects from
mlr3 via ’s (Wickham 2016)
autoplot function. For a BenchmarkResult, the
autoplot function provides a Box-plot comparison of
performances across the cross-validation folds for each
Learner. Figure @ref(fig:bmrbox) contains the box-plot
comparison. We can see that log_reg has higher accuracy and
lower interquartile range across the 10 folds, and we might therefore
want to prefer the log_reg model.
Model comparison based on accuracy for decision trees (rpart) and logistic regression (log_reg) across resampling splits.
Selecting the sensitive attribute
For a given task, we can select one or multiple sensitive attributes.
In , the sensitive attribute is identified by the column role
pta and can be set as follows:
task$set_col_roles("marital_status", add_to = "pta")
In the example above, we add the "marital_status" as an
additional sensitive attribute. This information is then automatically
passed on when the task is used, e.g., when computing fairness metrics.
If more than one sensitive attribute is specified, metrics will be
computed based on intersecting groups formed by the columns.
Quantifying fairness
With the package loaded, fairness measures can be constructed via
msr() like any other measure in . They are listed with
prefix fairness, and simply calling msr() without
any arguments will return a list of all available measures. Table
@ref(tab:metrics) provides an overview of some popular fairness measures
which are readily available.
| key | description |
|---|---|
| fairness.acc | Accuracy equality |
| fairness.mse | Mean squared error equality (Regression) |
| fairness.eod | Equalized odds |
| fairness.tpr | True positive rate equality / Equality of opportunity |
| fairness.fpr | False positive rate equality / Predictive equality |
| fairness.tnr | True negative rate equality |
| fairness.fnr | False negative rate equality |
| fairness.fomr | False omission rate equality |
| fairness.tnr | Negative predictive value equality |
| fairness.tnr | Positive predictive value equality |
| fairness.cv | Demographic parity / Equalized positive rates |
| fairness.pp | Predictive parity / Equalized precision |
| fairness.{tp, fp, tn, fn} | Equal true positives, false positives, true negatives, false negatives |
| fairness.acc_eod=.05 | Accuracy under equalized odds constraint |
| fairness.acc_ppv=.05 | Accuracy under ppv constraint |
Furthermore, new custom fairness measures can be easily implemented, either by implementing them directly or by composing them from existing metrics. This process is extensively documented in an accompanying measures vignette available with the package.
Here we choose the binary accuracy measure "classif.acc"
and the equalized odds metric from above using
"fairness.eod": The constructed list of measures can then
be used to score a Prediction, a
ResampleResult or BenchmarkResult, e.g.
measures = list(msr("classif.acc"), msr("fairness.eod"))
rr$aggregate(measures)
## classif.acc fairness.equalized_odds
## 0.84078 0.07837
We can clearly see a comparatively large difference in equalized odds at around 0.08. This means, that in total, the false positive rates (FPR) and true positive rates (TPR) on average differ by ~0.08, indicating that our model might exhibit a bias. Looking at the individual components yields a clearer picture. Here, we are looking at the confusion matrices of the combined predictions of the 10 folds, grouped by sensitive attribute:
fairness_tensor(rr)
## $Male
## truth
## response <=50K >50K
## <=50K 0.43030 0.10033
## >50K 0.03408 0.11202
##
## $Female
## truth
## response <=50K >50K
## <=50K 0.282668 0.020900
## >50K 0.003907 0.015789
Plotting the prediction density or comparing measures graphically often provides additional insights: For example, in Figure @ref(fig:predplots), we can see that Females are more often predicted to earn below $50.000. Similarly, we can see that both equality in FPR and TPR differ considerably.
fairness_prediction_density(prediction, task)
compare_metrics(prediction, msrs(c("fairness.fpr", "fairness.tpr", "fairness.eod")), task)
Visualizing predictions of the decision tree model. Left: Prediction densities for the negative class for Female and Male. Right: Fairness metrics comparison for FPR, TPR, EOd metrics. Plots show a higher likelihood for the ‘<50k’ class for females resulting in fairness metrics different from 0.
Bias mitigation
As mentioned above, several ways to improve a model’s fairness exist.
While non-technical interventions, such as e.g. collecting more data
should be preferred, provides several bias mitigation techniques that
can be used together with a Learner to obtain fairer
models. Table @ref(tab:biasmitigation) provides an overview of
implemented bias mitigation techniques. They are implemented as
PipeOps from the package and can be combined with arbitrary
learners using the %>>% operator to build a pipeline
that can later be trained. In the following example, we show how to
combine a learner with a reweighing scheme (reweighing_wts)
or alternatively how to post-process predictions using the equalized
odds debiasing (EOd) strategy. An introduction to is
available in the corresponding mlr3book chapter
(Bernd Bischl 2024).
po("reweighing_wts") %>>% lrn("classif.glmnet")
po("learner_cv", lrn("classif.glmnet")) %>>% po("EOd")
| Key | Description | Type | Reference |
|---|---|---|---|
| EOd | Equalized-Odds Debiasing | Postprocessing | Hardt, Price, and Srebro (2016) |
| reweighing_os | Reweighing (Oversampling) | Preprocessing | Kamiran and Calders (2012) |
| reweighing_wts | Reweighing (Instance weights) | Preprocessing | Kamiran and Calders (2012) |
It is simple for users or package developers to extend with
additional bias mitigation methods – as an example, the package adds
further post-processing methods that can improve fairness.
Along
with pipeline operators, contains several machine learning algorithms
listed in table @ref(tab:fairlearns) that can directly incorporate
fairness constraints. They can similarly be constructed using the
lrn() shorthand.
| Key | Package | Reference |
|---|---|---|
| regr.fairfrrm | fairml | Scutari, Panero, and Proissl (2021) |
| classif.fairfgrrm | fairml | Scutari, Panero, and Proissl (2021) |
| regr.fairzlm | fairml | Zafar et al. (2017) |
| classif.fairzlrm | fairml | Zafar et al. (2017) |
| regr.fairnclm | fairml | Komiyama et al. (2018) |
Reports
Because fairness aspects can not always be investigated based on the fairness definitions above (e.g., due to biased sampling or labelling procedures), it is important to document data collection and the resulting data as well as the models resulting from this data. Informing auditors about those aspects of a deployed model can lead to better assessments of a model’s fairness. Questionnaires for ML models (M. Mitchell et al. 2019) and data sets (Gebru et al. 2021) have been proposed in literature. We further add automated report templates using R markdown (Xie, Dervieux, and Riederer 2020) for data sets and ML models. In addition, we provide a template for a fairness report which includes many fairness metrics and visualizations to provide a good starting point for generating a fairness report inspired by the Aequitas Toolkit (Saleiro et al. 2018). A preview for the different reports can be obtained from the Reports vignette in the package documentation.
| Report | Description | Reference |
|---|---|---|
report_modelcard() |
Modelcard for ML models | M. Mitchell et al. (2019) |
report_datasheet() |
Datasheet for data sets | Gebru et al. (2021) |
report_fairness() |
Fairness Report | – |