Introduction

Predictive modeling is a process using mathematical and computational methods to forecast outcomes. Many algorithms in this area have been developed and novel ones are continuously being proposed. Therefore, there are countless possible models to choose from and a lot of ways to train a new new complex model. A poorly- or over-fitted model usually will be of no use when confronted with future data. Its predictions will be misleading (Sheather 2009) or harmful (O’Neil 2016). That is why methods that support model diagnostics are important.

Diagnostics are often carried out only by checking model assumptions. However, they are often neglected for complex machine learning models and they may be used as if they were assumption-free. Still, there is a need to verify their quality. We strongly believe that a genuine diagnosis or an audit incorporates a broad approach to model exploration. The audit includes three objectives.

  • Objective 1: Enrichment of information about model performance.
  • Objective 2: Identification of outliers, influential and abnormal observations.
  • Objective 3: Examination of other problems relating to a model by analyzing distributions of residuals, in particular, problems with bias, heteroscedasticity of variance and autocorrelation of residuals.

In this paper, we introduce the auditor package for R, which is a tool for diagnostics and visual verification. As it focuses on residuals1 and does not require any additional model assumptions, most of the presented methods are model-agnostic. A consistent grammar across various tools reduces the amount of effort needed to create informative plots and makes the validation more convenient and available.

graphic without alt text
Figure 1: Anscombe Quartet data sets are identical when examined with he use of simple summary statistics. The difference is noticeable after plotting the data.

Diagnostic methods have been a subject of much research (Atkinson 1985). Atkinson and M. Riani (2012) focus on graphical methods of diagnostics regression analysis. Liu, X. Wang, M. Liu, and J. Zhu (2017) present an overview of interactive visual model validation. One of the most popular tools for verification are measures of the differences between the values predicted by a model and the observed values (Willmott, S. G. Ackleson, R. E. Davis, J. J. Feddema, K. M. Klink, D. R. Legates, J. O’Donnell, and C. M. Rowe 1985). These tools include Root Mean Square Error (RMSE) and Mean Absolute Error (MAE) (Hastie, R. Tibshirani, and J. Friedman 2001). Such measures are used for well-researched and easily interpretable linear model as well as for complex models such as random forests (Ho 1995), gradient-boosted trees (Chen and C. Guestrin 2016), or neural networks (Venables and B. D. Ripley 2002).

However, no matter which measure of model performance we use, it does not reflect all aspects of the model. For example, Breiman (2001) points out that a linear regression model validated only on the basis of \(R^2\) may lead to many false conclusions. The best known example of this issue is the Anscombe Quartet (Anscombe 1973). It contains four different data sets constructed to have nearly identical simple statistical properties such as mean, variance, correlation, etc. These measures directly correspond to the coefficients of the linear models. Therefore, by fitting a linear regression to the Anscombe Quartet we obtain four almost identical models (see Figure 1). However, residuals of these models are very different. The Anscombe Quartet is used to highlight that the numerical measures should be supplemented by graphical data visualizations.

The analysis of diagnostics is well-researched for linear and generalized linear models. The said analysis is typically done by extracting raw, studentized, deviance, or Pearson residuals and examining residual plots. Common problems with model fit and basic diagnostics methods are presented in Faraway (2002) and Harrell Jr. (2006)

Model validation may involve both checking the overall trend in residuals and looking at residual values of individual observations (Littell, G. A. Milliken, W. W. Stroup, R. D. Wolfinger, and O. Schabenberger 2007). Gałecki and T. Burzykowski (2013) discussed methods based on residuals for individual observation and groups of observations.

Diagnostics methods are commonly used for linear regression (Faraway 2004). Complex models are treated as if they were assumption-free, which is why their diagnostics is often ignored. Considering the above, there is a need for more extensive methods and software dedicated for model auditing. Many of diagnostic tools, such as plots and statistics developed for linear models, are still useful for exploring machine learning models. Applying the same tools to all models facilitates their comparison.

The paper is organized as follows. Section 2 summarizes related work and state of the art. Section 3 contains an architecture of the auditor package. Section 4 provides the notation. Selected tools that help to validate models are presented in Section 5 and conclusions can be found in Section 6.

Related work

In this chapter, we overview common methods and tools for auditing and examining the validity of the models. There are several attempts to validate. They include diagnostics for predictor variables before and after model fit, methods dedicated to specific models, and model-agnostic approaches.

Data diagnostics before model fitting

The problem of data diagnostics is related to the Objective 2 presented in the 1, that is, the identification of problems with observations. There are several tools that address this issue. We review the most popular of them.

  • One of the tools that supports the identification of errors in data is the dataMaid package (Petersen and C. T. Ekstrom 2018). It creates a report that contains summaries and error checks for each variable in data. Package lumberjack (van der Loo 2017) provides row-wise analysis. It allows for monitoring changes in data as they get processed. The validatetools (de Jonge and M. van der Loo 2018) is a package for managing validation rules.
  • The datadist function from rms package (Harrell Jr 2018) computes distributional summaries for predictor variables. They include the overall range and certain quantiles for continuous variables, as well as distinct values for discrete variables. It automates the process of fitting and validating several models due to storing model summaries by datadist function.
  • While above packages use pipeline approaches, there are also tools that focus on specific step of data diagnostic. The package corrgram (Wright 2018) calculates a correlation of variables and displays corrgrams. Corrgrams (Friendly 2002) are visualizations of correlation matrices, that help to identify the relationship between variables.

Diagnostics methods for linear models

As linear models have a very simple structure and do not require high computational power, they have been and still are used very frequently. Therefore, there are many tools that validate different aspects of linear models. Below, we overview the most widely known tools implemented in R  packages.

  • The stats package provides basic diagnostic plots for linear models. Function plot generates six types of charts for "lm" and "glm" objects, such as a plot of residuals against fitted values, a scale-location plot of \(\sqrt{|residuals|}\) against fitted values and a normal quantile-quantile plot. These visual validation tools may be addressed to the Objective 3 of diagnostic, related to the examination of model by analyzing the distribution of residuals. The other three plots, that include: a plot of Cook’s distances, a plot of residuals against leverages, and a plot of Cook’s distances against \(\frac{leverage}{1-leverage}\) may be addressed to the identification of influential observations (Objective 1).

  • Package car (Fox and S. Weisberg 2011) extends the capabilities of stats by including more types of residuals, such as Pearson and deviance residuals. It is possible to plot against values of selected variables and to group residuals by levels of factor variables. What is more, car provides more diagnostic plots such as, among others, partial residual plot (crPlot), index plots of influence (infIndexPlot) and bubble plot of studentized residuals versus hat values (influencePlot). These plots allow for checking both the effect of observation and the distribution of residuals, what address to the Objective 2 and the Objective 3 respectively.

  • A linear regression model is still one of the most popular tools for data analysis due to its simple structure. Therefore, there is a rich variety of methods for checking its assumptions, for example, the normality of residual distribution and the homoscedasticity of the variance. The package nortest (Gross and U. Ligges 2015) provides five tests for normality: the Anderson-Darling (Anderson and D. A. Darling 1952), the Cramer-von Mises (Cramer 1928; Von Mises 1928), the Kolmogorov-Smirnov (Stephens 1974), the Pearson chi-square (S 1900), and the Shapiro-Francia (Sanford Shapiro and R. S. Francia 1972) tests. The lmtest package (Zeileis and T. Hothorn 2002) also contains a collection of diagnostic tests: the Breusch-Pagan (Breusch and A. R. Pagan 1979), the Goldfield-Quandt (Goldfeld and R. E. Quandt 1965) and the Harrison-McCabe (Harrison and B. P. M. McCabe 1979) tests for heteroscedasticity and the Harvey-Collier (Harvey and P. Collier 1977), the Rainbow (Utts 1982), and the RESET (Ramsey 1969) tests for nonlinearity and misspecified functional form. A unified approach for examining, monitoring and dating structural changes in linear regression models is provided in strucchange package (Zeileis, F. Leisch, K. Hornik, and C. Kleiber 2002). It includes methods to fit, plot and test fluctuation processes and F-statistics. The gvlma implements the global procedure for testing the assumptions of the linear model (find more details in Peña and E. H. Slate (2006)).

    The Box-Cox power transformation introduced by Box and D. R. Cox (1964) is a way to transform the data to follow a normal distribution. For simple linear regression, it is often used to satisfy the assumptions of the model. Package MASS (Venables and B. D. Ripley 2002) contains functions that compute and plot profile log-likelihoods for the parameter of the Box-Cox power transformation.

  • The broom package (Robinson 2018) provides summaries for about 30 classes of models. It produces results, such as coefficients and p-values for each variable, \(R^2\), adjusted \(R^2\), and residual standard error.

Other model-specific approaches

There are also several tools to generate validation plots for time series, principal component analysis, clustering, and others.

  • Tang, M. Horikoshi, and W. Li (2016) introduced the ggfortify interface for visualizing many popular statistical results. Plots are generated with ggplot2 (Wickham 2009), what makes them easy to modify. With one function autoplot it is possible to generate validation plots for a wide range of models. It works for, among others, lm, glm, ts, glmnet, and survfit objects. The autoplotly (Tang 2018) package is an extension of ggfortify and it provides functionalities that produce plots generated by plotly (Sievert, C. Parmer, T. Hocking, S. Chamberlain, K. Ram, M. Corvellec, and P. Despouy 2017). This allows for both modification and interaction with plots.

    However, ggorftify and autoplotly do not support some popular types of models, for instance, random forests from randomForest (Liaw and M. Wiener 2002) and ranger (Wright and A. Ziegler 2017) packages.

  • The hnp package (Moral, J. Hinde, and C. Demétrio 2017) provides half-normal plots with simulated envelopes. These charts evaluate the goodness of fit of any generalized linear model and its extensions. It is a graphical method for comparing two probability distributions by plotting their quantiles against each other. The package offers a possibility to extend the hnp for new model classes. However, this package provides only one tool for model diagnostic. In addition, plots are not based on ggplot2, what makes it difficult to modify them.

Model-agnostic approach

The tools presented above target specific model classes. The model-agnostic approach allows us to compare different models.

  • The DALEX (Descriptive mAchine Learning EXplanations) (Biecek 2018) is a methodology for exploration of black-box models. Main functionalities focus on understanding or proving how the input variables impact on final predictions. There are also two simple diagnostics: reversed empirical cumulative distribution function for absolute values of residuals and box plot of absolute values of residuals. As methods in the DALEX are model-agnostic, they allow for comparison of two or more models.
  • The package iml (Molnar, B. Bischl, and G. Casalicchio 2018) also contains methods for structure-agnostic exploration of model. For example, a measure of a feature’s importance by calculating the change of the model performance after permuting values of a variable.

Model-agnostic audit

In this paper, we present the auditor package for R, which fills out the part of model-agnostic validation. As it expands methods used for linear regression, it may be used to verify any predictive model.

Package Architecture

The auditor package works for any predictive model which returns a numeric value. It offers a consistent grammar of model validation, what is an efficient and convenient way to generate plots and diagnostic scores. A diagnostic score is a number that evaluates one of the properties of a model. That might be, for example, an accuracy of model, an independence of residuals or an influence of observation.

Figure 2 presents the architecture of the package. The auditor provides 2 pipelines for model validation. First of them consists of two steps. Function audit wraps up the model with meta-data, then the result is passed to the plot or score function. Second pipeline includes an additional step, which consists of calling one of the functions that generate computations for plots and scores. These functions are: modelResiduals, modelEvaluation, modelFit, modelPerformance, and observationInfluence. Further, we call them computational functions. Results of these functions are tidy data frames (Wickham 2014).

graphic without alt text
Figure 2: Architecture of the auditor. The blue color indicates the first pipeline, while orange indicates the second. Function audit takes model and data or “explainer” object created with the DALEX package.

Both pipelines for model audit are compared below.

  1. model %>% audit() %>% computational function %>% plot(type=…)
    We recommend this pipeline. Function audit wraps up a model with meta-data used for modeling and creates a "modelAudit" object. One of the computational functions takes "modelAudit" object and computes the results of validation. Then, outputs may be printed or passed to functions score and plot with defined type. We describe types of plots in Chapter 5. This approach requires one additional function within the pipeline. However, once created output of the computational function contains all necessary calculations for related plots. Therefore, generating multiple plots is fast.

  2. model %>% audit() %>% plot(type=…)
    This pipeline is shorter than the previous one. The only difference is that it does not include computational function. Calculations are carried out every time a generic plot function is called. Omitting one step might be convenient for ad-hoc model analyses.

Implemented types of plots are presented in Table 1. Scores are presented in Table 2. All plots are generated with ggplot2, what provides a convenient way to modify and combine plots.

Table 1: Columns contain respectively: name of the plot, name of the computational function, value for type parameter of the function plot, indications whether the plot can be applied to regression and classification tasks.
Plot Function plot(type = ...) Reg. Class.
Autocorrelation Function modelResiduals "ACF" + +
Autocorrelation modelResiduals "Autocorrelation" + +
Cooks’s Distances observationInfluence "CooksDistance" + +
Half-Normal modelFit "HalfNormal" + +
LIFT Chart modelEvaluation "LIFT" +
Model Correlation modelResiduals "ModelCorrelation" + +
Model PCA modelResiduals "ModelPCA" + +
Model Ranking modelPerformance "ModelRanking" + +
Predicted Response modelPerformance "ModelPerformance" + +
REC Curve modelResiduals "REC" + +
Residuals modelResiduals "Residual" + +
Residual Boxplot modelResiduals "ResidualBoxplot" + +
Residual Density modelResiduals "ResidualDensity" + +
ROC Curve modelEvaluation "ROC" +
RROC Curve modelResiduals "RROC" + +
Scale-Location modelResiduals "ScaleLocation" + +
Two-sided ECDF modelResiduals "TwoSidedECDF" + +
Table 2: Columns contain respectively: name of a score, name of a computational function, value for type parameter of function the score, indications whether the score can be applied to regression and classification tasks.
Score Function score(type = ...) Reg. Class.
Cook’s Distance observationInfluence "CooksDistance" + +
Durbin-Watson modelResiduals "DW" + +
Half-Normal modelFit "HalfNormal" + +
Mean Absolute Error modelResiduals "MAE" + +
Mean Squared Error modelResiduals "MSE" + +
Area Over the REC modelResiduals "REC" + +
Root Mean Squared Error modelResiduals "RMSE" + +
Area Under the ROC modelEvaluation "ROC" +
Area Over the RROC modelResiduals "RROC" + +
Runs modelResiduals "Runs" + +
Peak modelResiduals "Peak" + +

Notation

Let us use the following notation: \(x_i = (x_i^{(1)}, x_i^{(2)}, ..., x_i^{(p)}) \in \mathcal{X} \subset \mathcal{R}^{p}\) is a vector in space \(\mathcal{X}\), \(y_i \in \mathcal{R}\) is an observed response associated with \(x_i\). A single observation we denote as a pair \((y_i, x_i)\) and \(n\) is the number of observations.

Let us denote a model as a function \(f: \mathcal{X} \to \mathcal{R}\). Predictions of the model \(f\) for particular observation we shall denote as \[f(x_i) = \hat{y_i}.\] The row residual, or simply the residual, is the difference between the observed value \(y_i\) and the predicted value \(\hat{y_i}\). We shall denote residual of particular observation as \[r_i = y_i - \hat{y_i}.\]

Illustrations

Diagnostics allows for evaluation of different properties of a model. They may be related to the following questions: Which model has better performance? Does the model fit data? Which observations are abnormal? These questions are directly related to the diagnostics objectives described in the 1. First of them refers to the evaluation of a model performance, which was proposed as the Objective 1. The second question concerns the examination of residuals distribution (Objective 3). The last one refers to outliers and influential observations (Objective 2).

In this Section we illustrate chosen validation tools that allow for exploration of the above issues. To demonstrate applications of the auditor, we use the data set apartments available in the DALEX package. First, we fit two models: simple linear regression and random forest.

  library("auditor")
  library("DALEX")
  library("randomForest")

  lm_model <- lm(m2.price ~ ., data = apartments)
  set.seed(59)
  rf_model <- randomForest(m2.price ~ ., data = apartments)

The next step creates "modelAudit" objects related to these two models.

  lm_audit <- audit(lm_model, label = "lm", 
                data = apartmentsTest, y = apartmentsTest$m2.price)
  rf_audit <- audit(rf_model, label = "rf", 
                data = apartmentsTest, y = apartmentsTest$m2.price)

Below, we create objects of class "modelResidual", which are needed to generate plots. Parameter variable determines the order of residuals in the plot. When the variable argument is set to "Fitted values" residuals are sorted by values of predicted responses. Entering a name of a variable "m2.price" implies that residuals will be in order of this variable.

  lm_res_fitted <- modelResiduals(lm_audit, variable = "Fitted values")
  rf_res_fitted <- modelResiduals(rf_audit, variable = "Fitted values")
  
  lm_res_observed <- modelResiduals(lm_audit, variable = "m2.price")
  rf_res_observed <- modelResiduals(rf_audit, variable = "m2.price")

Model Ranking Plot

In this subsection, we propose a Model Ranking plot which compares models performance across multiple measures (see Figure 3). The implemented measures are listed in Table 2 in Chapter 3. The descriptions of all scores are in (Gosiewska and P. Biecek 2018).

Model Ranking Radar plot consists of two parts. On the left side there is a radar plot. Colors correspond to models, edges to values of scores. Score values are inverted and rescaled to \([0,1]\).

Let us use the following notation: \(m_i \in \mathcal{M}\) is a model in a finite set of models \(\mathcal{M}\), where \(|\mathcal{M}| = k\), \(score: \mathcal{M} \to \mathbb{R}\) is a loss function for the model under consideration. Higher values mean worse model performance. The \(score(m_i)\) is a performance of model \(m_i\).

Definition 1. We define the inverted score of model \(m_i\) as \[\label{invscore-2018-143} invscore(m_i) = \frac{1}{score(m_i)} \min_{j=1...k}{score(m_j)}. (\#eq:invscore-2018-143)\]

Models with the larger \(invscore\) are closer to the centre. Therefore, the best model is located the farthest from the center of the plot. On the right side of the plot is a table with results of scoring. The third column contains scores scaled to one of the models.

Let \(m_l \in \mathcal{M}\) where \(l \in \{ 1,2, ..., k \}\) be a model to which we scale.

Definition 2. We define the scaled score of model \(m_i\) to model \(m_l\) as \[scaled_l(m_i) = \frac{score(m_l)}{score(m_i)}.\]

As values of \(scaled_l(m_l)\) are always between \(0\) and \(1\), comparison of models is easy, regardless of the ranges of scores.

The plot below is generated by plot function with parameter type = "ModelRanking" or by function plotModelRanking. The scores included in the plot may be specified by scores parameter.

  rf_mp <- modelPerformance(rf_audit)
  lm_mp <- modelPerformance(lm_audit)
  plot(rf_mp, lm_mp, type = "ModelRanking")
graphic without alt text
Figure 3: Model Ranking Plot. Random forest (red) has better performance in aspect of MAE and REC scores, while linear model (blue) is better in aspect of MSE and RROC scores.

REC Curve Plot

Regression Error Characteristic (REC) curve (see Figure 4) is a generalization of Receiver Operating Characteristic (ROC) curve for binary classification (Swets 1988).

REC curve estimates the Cumulative Distribution Function of the error. On the x axis of the plot there is an error tolerance. On the y axis there is an accuracy at the given tolerance level. Bi and K. P. Bennett (2003) define the accuracy at tolerance \(\epsilon\) as a percentage of observations predicted within the tolerance \(\epsilon\). In other words, residuals larger than \(\epsilon\) are considered as errors.

Let us consider pairs \((y_i, x_i)\) introduced in the beginning of Chapter 5. Bi and K. P. Bennett (2003) define an accuracy as follows.

Definition 3. An accuracy at tolerance level \(\epsilon\) is given by \[acc(\epsilon) = \frac{|\{ (x,y): loss(f(x_i),y_i) \leq \epsilon, i = 1,...,n \}|}{n}.\]

REC Curves implemented in the auditor are plotted for a special case of Definition 3 where the loss is defined as \[loss(f(x_i),y_i) = |f(x_i) - y_i| = |r_i|.\] The shape of the curve illustrates the behavior of errors. The quality of the model can be evaluated and compared for different tolerance levels. The stable growth of the accuracy does not indicate any problems with the model. A small increase of accuracy near \(0\) and the areas where the growth is fast signalize bias of the model predictions.

The plot below is generated by plot function with parameter type = "REC" or by plotREC function.

  plot(rf_res_fitted, lm_res_fitted, type = "REC")
graphic without alt text
Figure 4: REC curve. Curve for linear model (blue) suggests that the model is biased. It displays poor accuracy when the tolerance ϵ is small. However, once ϵ exceeds the error tolerance 130, accuracy rapidly increases. The random forest (red) has a stable increase of accuracy when compared to the linear model. However, there is s fraction of large residuals.

As often it is difficult to compare models on the plot, there is an REC score implemented in the auditor. This score is the Area Over the REC Curve (AOC), which is a biased estimate of the expected error for a regression model. As Bi and K. P. Bennett (2003) proved, AOC provides a measure of the overall performance of regression model.

Scores may be obtained by score function with type = "REC" or scoreREC function.

  scoreREC(lm_res_fitted)
  scoreREC(rf_res_fitted)

Residual Boxplot Plot

Residual boxplot shows the distribution of the absolute values of residuals \(r_i\). They may be used for analysis and comparison of residuals. Example plots are presented in Figure 5. Boxplots (Tukey 1977) usually consist of five components. The box itself corresponds to the first quartile, median, and third quartile. The whiskers extend to the smallest and largest values, no further than 1.5 of Interquartile Range (IQR) from the first and third quartile respectively. Residual boxplots consists of a sixth component, namely a red dot which stands for Root Mean Square Error (RMSE). In case of an appropriate model, most of the residuals should lay near zero. A large spread of values indicates problems with a model.

The plot presented below is generated by plotResidualBoxplot or by plot function with parameter type = ’ResidualBoxplot’ function.

  plot(lm_res_fitted, rf_res_fitted, type = "ResidualBoxplot")
graphic without alt text
Figure 5: Boxplots of absolute values of residuals. Dots are in similar places, hence RMSE for both models is almost identical. However, the distribution of residuals of these two models is different. For the linear model (blue), most of the residuals are around the average. For the random forest (red), most residuals are small. Nevertheless, there is also a fraction of large residuals.

Residual Density Plot

Residual Density plot detects the incorrect behavior of residuals. An example is presented in Figure 6. On the plot, there are estimated densities of residuals. For some models, the expected shape of density derives from the model assumptions. For example, simple linear model residuals should be normally distributed. However, even if the model does not have an assumption about the distribution of residuals, such a plot may be informative. If most of the residuals are not concentrated around zero, it is likely that the model predictions are biased. Values of errors are displayed as marks along the x axis. That makes it possible to ascertain whether there are individual observations or groups of observations with residuals significantly larger than others.

The plot below is generated by plotResidualDensity function or by plot function with parameter type = "ResidualDensity".

  plot(rf_res_observed, lm_res_observed, type = "ResidualDensity")
graphic without alt text
Figure 6: Residual Density Plot. The density of residuals for the linear model (blue) forms two peaks. There are no residuals with values around zero. Residuals do not follow the normal distribution, what is one of the assumptions of the simple linear regression. There is an asymmetry of residuals generated by random forest (red).

Two-sided ECDF Plot

Two-sided ECDF plot (see Figure 7) shows an Empirical Cumulative Distribution Functions (ECDF) for positive and negative values of residuals separately.

Let \(x_1, ..., x_n\) be a random sample from a cumulative distribution function \(F(t)\). The following definition comes from van der Vaart (2000).

Definition 4. The empirical cumulative distribution function is given by \[F_n(t) = \frac{1}{n} \sum_{i=1}^n \mathbb{1} \{ x_i \leq t\}.\] Empirical cumulative distribution function presents a fraction of observations that are less than or equal to \(t\). It is an estimator for the cumulative distribution function \(F(t)\).

On the positive side of the x-axis, there is the ECDF of positive values of residuals. On the negative side, there is a transformation of ECDF: \[F_{rev}(t) = 1 - F(t).\] Let \(n_N\) and \(n_P\) be numbers of negative and positive values of residuals respectively. Negative part of the plot is normalized by multiplying it by the ratio of the \(n_N\) over \(n_N + n_P\). Similarly, positive part is normalized by multiplying it by the ratio of the \(n_P\) over \(n_N + n_P\). Due to the applied scale, the ends of the curves add up to \(100\%\) in total. The plot shows the distribution of residuals divided into groups with positive and negative values. It helps to identify the asymmetry of the residuals. Points represent individual error values, what makes it possible to identify ‘outliers’.

The plot below is generated by plotTwoSidedECDF function or by plot function with parameter type = "TwoSidedECDF".

  plot(rf_res_fitted, lm_res_fitted, type = "TwoSidedECDF")
graphic without alt text
Figure 7: Two-sided ECDF plot. The plot shows that majority of residuals for the random forest (red) is smaller than residuals for the linear model (blue). However, random forest has also fractions of large residuals.

Conclusion and future work

In this article, we presented the auditor package and selected diagnostic scores and plots. We discussed the existing methods of model validation and proposed new visual approaches. We also specified three objectives of model audit (see Section 1), proposed relevant verification tools, and demonstrated their usage. Model Ranking Plot and REC Curve enrich the information about model performance (Objective 1). Residual Boxplot, Residual Density, and Two-Sided ECDF Plots expand the knowledge about the distribution of residuals (Objective 3). What is more, the latter two tools allow for identification of outliers (Objective 2). Finally, we proposed two new plots, the Model Ranking Plot and the Two-Sided ECDF Plot.

We implemented all the presented scores and plots in the auditor package for R. The included functions are based on a uniform grammar introduced in Figure 3. Documentation and examples are available at https://mi2datalab.github.io/auditor/. The stable version of the package is on CRAN, the development version is on GitHub (https://github.com/MI2DataLab/auditor). A more detailed description of methodology is available in the extended version of this paper on arXiv: https://arxiv.org/abs/1809.07763 (Gosiewska and P. Biecek 2018).

There are many potential areas for future work that we would like to explore, including more extensions of model-specific diagnostics to model-agnostic methods and residual-based methods for investigating interactions. Another potential aim would be to develop methods for local audit based on the diagnostics of a model around a single observation or a group of observations.

Acknowledgements

We would like to acknowledge and thank Aleksandra Grudziąż and Mateusz Staniak for valuable discussions. Also, we wish to thank Dr. Rafael De Andrade Moral for his assistance and help related to the hnp package.

The work was supported by NCN Opus grant 2016/21/B/ST6/02176.

Anderson and D. A. Darling, T. W. 1952. Asymptotic theory of certain goodness of fit criteria based on stochastic processes. Ann Math Statist. 23 (2): https://doi.org/10.1214/aoms/1177729437.
Anscombe, F. J. 1973. Graphs in statistical analysis. The American Statistician 27 (1): https://doi.org/10.1080/00031305.1973.10478966.
Atkinson, A. C. 1985. Plots, Transformations, and Regression: An Introduction to Graphical Methods of Diagnostic Regression Analysis. Oxford statistical science series Clarendon Press. https://books.google.pl/books?id=oFjgnQEACAAJ.
Atkinson and M. Riani, A. 2012. Robust Diagnostic Regression Analysis. Springer Series in Statistics Springer-Verlag. https://books.google.pl/books?id=sZ3SBwAAQBAJ.
Bi and K. P. Bennett, J. 2003. Regression error characteristic curves. In ICML.
Biecek, P. 2018. DALEX: Explainers for Complex Predictive Models. ArXiv e-prints.
Box and D. R. Cox, G. E. P. 1964. An analysis of transformations. Journal of the Royal Statistical Society B pages.
Breiman, L. 2001. Statistical modeling: The two cultures (with comments and a rejoinder by the author). Statist Sci. 16 (3): https://doi.org/10.1214/ss/1009213726.
Breusch and A. R. Pagan, T. S. 1979. A simple test for heteroscedasticity and random coefficient variation. Econometrica 47 (5): ISSN 00129682 14680262. http://www.jstor.org/stable/1911963.
Chen and C. Guestrin, T. 2016. Xgboost: A scalable tree boosting system. CoRR abs/1603.02754. http://arxiv.org/abs/1603.02754.
Cramer, H. 1928. On the composition of elementary errors: Second paper: Statistical applications. Scandinavian Actuarial Journal (1):
de Jonge and M. van der Loo, E. 2018. Validatetools: Checking and Simplifying Validation Rule Sets, . R package version 0.4.3. https://CRAN.R-project.org/package=validatetools.
Faraway, J. J. 2002. Practical Regression and Anova Using R. University of Bath. https://books.google.pl/books?id=UjhBnwEACAAJ.
———. 2004. Linear Models with R. Chapman & Hall/CRC Texts in Statistical Science Taylor & Francis. https://books.google.pl/books?id=fvenzpofkagC.
Fox and S. Weisberg, J. 2011. An R Companion to Applied Regression. Sage Thousand Oaks CA 2nd edition. http://socserv.socsci.mcmaster.ca/jfox/Books/Companion.
Friendly, M. 2002. Corrgrams: Exploratory displays for correlation matrices. The American Statistician 56 (4):
Gałecki and T. Burzykowski, A. 2013. Linear Mixed-Effects Models Using R: A Step-by-Step Approach. Springer Texts in Statistics Springer-Verlag. https://books.google.pl/books?id=rbk_AAAAQBAJ.
Goldfeld and R. E. Quandt, S. M. 1965. Some tests for homoscedasticity. Journal of the American Statistical Association 60(310): https://doi.org/10.1080/01621459.1965.10480811.
Gosiewska and P. Biecek, A. 2018. auditor: An R Package for Model-Agnostic Visual Validation and Diagnostic. ArXiv e-prints.
Gross and U. Ligges, J. 2015. Nortest: Tests for Normality, . R package version 1.0-4. https://CRAN.R-project.org/package=nortest.
Harrell Jr, F. E. 2018. Rms: Regression Modeling Strategies, . R package version 5.1-2. https://CRAN.R-project.org/package=rms.
Harrell Jr., F. E. 2006. Regression Modeling Strategies. Springer-Verlag Berlin Heidelberg.
Harrison and B. P. M. McCabe, M. J. 1979. A test for heteroscedasticity based on ordinary least squares residuals. Journal of the American Statistical Association 74(366): ISSN 01621459. http://www.jstor.org/stable/2286361.
Harvey and P. Collier, A. C. 1977. Testing for functional misspecification in regression analysis. Journal of Econometrics 6 (1): 103 – 119 ISSN 0304-4076. https://doi.org/10.1016/0304-4076(77)90057-4.
Hastie, R. Tibshirani, and J. Friedman, T. 2001. The Elements of Statistical Learning. Springer Series in Statistics Springer-Verlag New York NY USA.
Ho, T. K. 1995. Random decision forests. In Proceedings of the Third International Conference onDocument Analysis; Recognition (Volume 1) - Volume 1 ICDAR ’95 pages278– Washington DC USA IEEE Computer Society. http://dl.acm.org/citation.cfm?id=844379.844681.
Liaw and M. Wiener, A. 2002. Classification and regression by randomforest. R News 2 (3): https://CRAN.R-project.org/doc/Rnews/.
Littell, G. A. Milliken, W. W. Stroup, R. D. Wolfinger, and O. Schabenberger, R. C. 2007. SAS for Mixed Models, Second Edition. SAS Institute. https://books.google.pl/books?id=z9qv32OyEu4C.
Liu, X. Wang, M. Liu, and J. Zhu, S. 2017. Towards better analysis of machine learning models: A visual analytics perspective. Visual Informatics 1 (1): 48 – 56 ISSN 2468-502X. https://doi.org/10.1016/j.visinf.2017.01.006.
Molnar, B. Bischl, and G. Casalicchio, C. 2018. Iml: An r package for interpretable machine learning. JOSS 3 (26): 786. https://doi.org/10.21105/joss.00786.
Moral, J. Hinde, and C. Demétrio, R. 2017. Half-normal plots and overdispersed models in r: The hnp package. Journal of Statistical Software Articles 81(10): ISSN 1548-7660. https://doi.org/10.18637/jss.v081.i10.
O’Neil, C. 2016. Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy. Crown Publishing Group New York NY USA 9780553418811.
Peña and E. H. Slate, E. A. 2006. Global validation of linear model assumptions. Journal of the American Statistical Association 101(473): PMID: 20157621. https://doi.org/10.1198/016214505000000637.
Petersen and C. T. Ekstrom, A. H. 2018. dataMaid: A Suite of Checks for Identification of Potential Errors in a Data Frame as Part of the Data Screening Process, . R package version 1.1.2. https://CRAN.R-project.org/package=dataMaid.
Ramsey, J. B. 1969. Tests for specification errors in classical linear least-squares regression analysis. Journal of the Royal Statistical Society B 31(2): ISSN 00359246. http://www.jstor.org/stable/2984219.
Robinson, D. 2018. Broom: Convert Statistical Analysis Objects into Tidy Data Frames, . R package version 0.4.4. https://CRAN.R-project.org/package=broom.
S, K. P. F. R. 1900. X. On the Criterion That a Given System of Deviations from the Probable in the Case of a Correlated System of Variables Is Such That It Can Be Reasonably Supposed to Have Arisen from Random Sampling, volume 50. Taylor & Francis. https://doi.org/10.1080/14786440009463897.
Sanford Shapiro and R. S. Francia, S. 1972. An approximate analysis of variance test for normality. Journal of the American Statistical Association 67:
Sheather, S. 2009. A Modern Approach to Regression with R. Springer Texts in Statistics Springer-Verlag. https://books.google.pl/books?id=zS3Jiyxqr98C.
Sievert, C. Parmer, T. Hocking, S. Chamberlain, K. Ram, M. Corvellec, and P. Despouy, C. 2017. Plotly: Create Interactive Web Graphics via ’plotly.js’, . R package version 4.7.1. https://CRAN.R-project.org/package=plotly.
Stephens, M. A. 1974. Edf statistics for goodness of fit and some comparisons. Journal of the American Statistical Association 69(347): ISSN 01621459. http://www.jstor.org/stable/2286009.
Swets, J. 1988. Measuring the accuracy of diagnostic systems. Science 240 (4857): ISSN 0036-8075. https://doi.org/10.1126/science.3287615.
Tang, M. Horikoshi, and W. Li, Y. 2016. ggfortify: Unified Interface to Visualize Statistical Results of Popular R Packages. The R Journal 8 (2): https://journal.r-project.org/archive/2016/RJ-2016-060/index.html.
Tang, Y. 2018. Autoplotly - Automatic Generation of Interactive Visualizations for Popular Statistical Results. ArXiv e-prints.
Tukey, J. W. 1977. Exploratory Data Analysis. Addison-Wesley series in behavioral science Addison-WesleyPublishing Company. https://books.google.pl/books?id=UT9dAAAAIAAJ.
Utts, J. M. 1982. The rainbow test for lack of fit in regression. Communications in Statistics - Theory; Methods 11(24): https://doi.org/10.1080/03610928208828423.
van der Loo, M. 2017. Lumberjack: Track Changes in Data, . R package version 0.2.0. https://CRAN.R-project.org/package=lumberjack.
van der Vaart, A. W. 2000. Asymptotic Statistics. Asymptotic Statistics Cambridge University Press. https://books.google.pl/books?id=UEuQEM5RjWgC.
Venables and B. D. Ripley, W. N. 2002. Modern Applied Statistics with S. Springer-Verlag New York 4th edition. http://www.stats.ox.ac.uk/pub/MASS4.
Von Mises, R. 1928. Wahrscheinlichkeit, Statistik Und Wahrheit. Number t 3 in Schriften zur wissenschaftlichen Weltauffassung.Springer-Verlag. https://books.google.pl/books?id=W1IaAAAAIAAJ.
Wickham, H. 2009. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag. http://ggplot2.org.
———. 2014. Tidy data. The Journal of Statistical Software 59. http://www.jstatsoft.org/v59/i10/.
Willmott, S. G. Ackleson, R. E. Davis, J. J. Feddema, K. M. Klink, D. R. Legates, J. O’Donnell, and C. M. Rowe, C. J. 1985. Statistics for the evaluation and comparison of models. Journal of Geophysical Research 90 (C5):8995. https://doi.org/10.1029/jc090ic05p08995.
Wright and A. Ziegler, M. N. 2017. ranger: A fast implementation of random forests for high dimensional data in C++ and R. Journal of Statistical Software 77 (1): https://doi.org/10.18637/jss.v077.i01.
Wright, K. 2018. Corrgram: Plot a Correlogram, . R package version 1.13. https://CRAN.R-project.org/package=corrgram.
Zeileis and T. Hothorn, A. 2002. Diagnostic checking in regression relationships. R News 2 (3): https://CRAN.R-project.org/doc/Rnews/.
Zeileis, F. Leisch, K. Hornik, and C. Kleiber, A. 2002. Strucchange: An r package for testing for structural change in linear regression models. Journal of Statistical Software Articles 7(2): ISSN 1548-7660. https://doi.org/10.18637/jss.v007.i02.

  1. Residual of an observation is the difference between the observed value and the value predicted by a model.↩︎