Bayesian Model Selection with Latent Group-Based Effects and Variances with the R Package slgf

Abstract

Linear modeling is ubiquitous, but performance can suffer when the model is misspecified. We have recently demonstrated that latent groupings in the levels of categorical predictors can complicate inference in a variety of fields including bioinformatics, agriculture, industry, engineering, and medicine. Here we present the R package slgf which enables the user to easily implement our recently-developed approach to detect group-based regression effects, latent interactions, and/or heteroscedastic error variance through Bayesian model selection. We focus on the scenario in which the levels of a categorical predictor exhibit two latent groups. We treat the detection of this grouping structure as an unsupervised learning problem by searching the space of possible groupings of factor levels. First we review the suspected latent grouping factor (SLGF) method. Next, using both observational and experimental data, we illustrate the usage of slgf in the context of several common linear model layouts: one-way analysis of variance (ANOVA), analysis of covariance (ANCOVA), a two-way replicated layout, and a two-way unreplicated layout. We have selected data that reveal the shortcomings of classical analyses to emphasize the advantage our method can provide when a latent grouping structure is present.

Introduction

Linear models with categorical predictors (i.e., factors) are pervasive in the social, natural, and engineering sciences, among other fields. Conventional approaches to fit these models may fail to account for subtle latent structures, including latent regression effects, interactions, and heteroscedasticity within the data. These latent structures are frequently governed by the levels of a factor. Several examples of such datasets can be found in Franck, Nielsen, and Osborne (2013), Franck and Osborne (2016), Kharrati-Kopaei and Sadooghi-Alvandi (2007), and Metzger and Franck (2021). Our recent work (Metzger and Franck 2021) developed latent grouping factor-based methodology to detect latent structures using Bayesian model selection. The current work provides an overview of the slgf package that enables users to easily implement the suspected latent grouping factor (SLGF) methodology, and expands on the previous work by allowing for more flexible model specification.

Consider Figure @ref(fig:figureintro), which illustrates four relevant data sets analyzed in this paper. In each panel, the levels of a user-specified factor are found to exhibit a latent grouping structure that partitions the data into two groups with distinct regression effects (indicated by color-coding) and/or error variances (filled and open geometry). With the slgf package, the user specifies the factor suspected of governing this latent structure. The package protects the user against detecting spurious latent grouping structures since it can accommodate non-grouped candidate models. It can also accommodate additional linear model terms of interest. The slgf package then assesses the plausibility of each model and the corresponding structures via Bayesian model selection. An overview of slgf functionality for these data follows and full details of each analysis (including candidate models) appear in Section Using the slgf package. The slgf package focuses on assessing the plausibility of two-group structures in linear models with categorical predictors using fractional Bayes factors. A discussion comparing slgf and other R packages that address latent group models is in Section Conclusion.

Smell data , textile data , locknut data , and bottles data . Color (red/gray) shows latent grouping structure (i.e., group-based regression effects) for smell, textile, and bottles data, and fill (solid/open geometry) shows group-based variances for smell, textile, and locknut data.

The top left panel of Figure @ref(fig:figureintro) represents a one-way analysis of variance (ANOVA) study where a continuous measurement of olfactory function (vertical axis) is modeled as a function of age, where age is a factor represented in five categories (horizontal axis) (O’Brien and Heft 1995). We find the highest posterior model probability (61%) for the model where levels 1, 2, and 3 of the SLGF age have distinct mean effects and error variances from levels 4 and 5. We call this the smell data set.

The top right panel shows an analysis of covariance (ANCOVA), where the breaking strength of a starch film (vertical axis) is measured as a function of the SLGF (starch type) and a continuous measurement of film thickness (horizontal axis) (Furry 1939). We find the highest posterior model probability (59%) for the model where potato starch (unshaded gray squares) have a larger error variance than the shaded points, and, the red points (canna and corn starch) have a distinct slope from the gray points. We call this the textile data set.

The bottom left panel shows the example described by Meek and Ozgur (1991), where the torque required to tighten a locknut (vertical axis) was measured as a function of a plating process and a threading technique. The plating processes analyzed included treatments with cadmium and wax (CW), heat treating (HT), and phosphate and oil (PO). The threading techniques studied include bolt and mandrel, the types of fixture on which each locknut was affixed to conduct the test. We find the highest posterior model probability (85%) for the model where bolt by HT and bolt by PO measurements have a larger error variance than those from bolt by CW, mandrel by HT, mandrel by PO, and mandrel by CW. We call this the locknut data.

Finally, in the bottom right panel, the data set of Ott and Snee (1973) represents an unreplicated two-way layout where six machine nozzles were used to fill bottles on five occasions (horizontal axis). The weight of each bottle (vertical axis) was measured, and we find the highest posterior model probability for the structure where nozzle 5 is found to be out of alignment from the others ($>99$%). We call this the bottles data.

The slgf package implements a combinatoric approach that evaluates all possible assignments of SLGF levels into two groups. We refer to each these assignments as schemes. For example, in the smell data, the scheme that is visualized assigns age levels 1, 2, and 3 into one group and levels 4 and 5 into the other, denoted {1,2,3}{4,5}. More details on how schemes are established can be found in Subsection Grouping schemes and model classes.

The user may specify an SLGF for regression effects, another SLGF for error variances, require them to be the same, or specify no SLGF for one or both of these. For example, the smell data has age as the SLGF for both. In Subsection Case study 2: textile data, we analyze a data set with distinct regression and error variance SLGFs.

In this paper, we provide an overview of the slgf package that enables analysis of data sets like those in Figure @ref(fig:figureintro) via Bayesian model selection. In Section SLGF methodology, we briefly review the SLGF methodology. In Section Using the slgf package, we illustrate the package functionality for the four data sets illustrated in Figure @ref(fig:figureintro). For each data set, we will demonstrate the relevant code and package functionality along with a comparison between the results of a classical approach and our approach. In Section Conclusion, we summarize the package and its functionality.

SLGF methodology

Model specification

For a thorough review of the SLGF model specification see Metzger and Franck (2021). First consider the linear model

\[\begin{equation} \boldsymbol{Y}=\boldsymbol{\mathbf{1}}^T \alpha + \boldsymbol{\mathbf{W}}\boldsymbol{\nu}+\boldsymbol{\mathbf{V}}\boldsymbol{\tau}+\boldsymbol{\mathbf{U}}\boldsymbol{\rho}+\boldsymbol{\varepsilon,} (\#eq:equationWVUmodel) \end{equation}\]

where $1^T$ is an $N\times 1$ vector of 1s, $\alpha$ is an intercept, $\boldsymbol{\nu}$ represents the full SLGF effect with $K$ degrees of freedom, $\boldsymbol{\tau}$ represents the regression effects that do not arise from latent groupings (i.e., all other regression effects of interest), and the $\boldsymbol{\rho}$ terms indicate statistical interactions between SLGF and other regression effects; $W$, $V$, and $U$ partition the overall model matrix into model matrices corresponding to the SLGF effects $\boldsymbol{\rho}$, additional effects $\boldsymbol{\tau}$, and SLGF interactions, respectively; and finally $\boldsymbol{\varepsilon}$ represents an $N\times 1$ vector of errors where $\boldsymbol{\varepsilon}\overset{\text{iid}}{\sim}N(0,\Sigma)$ for $\Sigma=\sigma^2I$ where $I$ is an $N\times N$ identity matrix.

Because a central goal of the SLGF methodology is to compare models with and without latent grouping structures, we next develop notation to indicate whether model terms in Equation (@ref(eq:equationWVUmodel)) involve groupings of factor levels or not. If a model contains a one degree of freedom group effect instead of the full $K$ degree of freedom SLGF effect, we denote the effect $\tilde{\boldsymbol{\nu}}$ instead, with corresponding $\tilde{W}$ to ensure they remain conformable. Similarly, if the interaction $\boldsymbol{\rho}$ is with the group effect rather than the full SLGF effect, we denote it $\tilde{\boldsymbol{\rho}}$. When there are group-based error variances, we let $\tilde{\boldsymbol{\varepsilon}}$ denote the vector of heteroscedastic errors, where the elements of $\tilde{\boldsymbol{\varepsilon}}$ are either $N(0,\sigma^2_1)$ or $N(0,\sigma^2_2)$ depending on their membership in group 1 or 2, respectively.

For example, for the smell data in the top left panel of Figure @ref(fig:figureintro), the most probable model can be represented as $Y=1^T\alpha + \tilde{W}\tilde{\boldsymbol{\nu}}+\tilde{\boldsymbol{\varepsilon}}$, with a 1 degree of freedom group effect $\tilde{\boldsymbol{\nu}}$ (color-coding) and heteroscedastic error term $\tilde{\boldsymbol{\varepsilon}}$ (shading). This model (posterior model probability 0.65) was found to be far more probable than the ordinary one way analysis of variance model $Y=1^T\alpha + W{\boldsymbol{\nu}}+{\boldsymbol{\varepsilon}}$ (posterior model probability less than 0.0001), the model with a 4 degree of freedom mean effect ${\boldsymbol{\nu}}$ and homoscedastic errors ${\boldsymbol{\varepsilon}}$. Similarly, the bottles data (bottom right panel) most probable model is $Y=1^T\alpha + W{\boldsymbol{\nu}}+\tilde{U}\tilde{\boldsymbol{\rho}}+{\boldsymbol{\varepsilon}}$ with a 4 degree of freedom nozzle effect ${\boldsymbol{\nu}}$, an 8 degree of freedom group-by-nozzle interaction $\tilde{\boldsymbol{\rho}}$, and homoscedastic errors $\boldsymbol{\varepsilon}$.

Grouping schemes and model classes

Recall schemes are the possible assignments of factor levels to two latent groups. While the schemes shown in Figure @ref(fig:figureintro) may seem visually obvious, the slgf package considers all possible such assignments of factor levels into two groups. This (i) obviates the need for the user to specify specific schemes, and (ii) apportions prior model probabilities commensurately with the actual number of models corresponding to a SLGF to prevent detection of spurious latent grouping structure. Problems will differ in the number of schemes under consideration. The package slgf automatically determines the schemes once the set of candidate models has been established by the user. The minimum number of levels that can comprise a grouping scheme can be adjusted by the user to lower the number of candidate models or to avoid creating model effects with too few degrees of freedom to be estimated. The user may specify the SLGF for regression effects and/or error variances, or neither. These SLGFs may or may not be different factors. If they are the same, the user may require that the grouping schemes must be equal or that they may be distinct. For example, in the textile data in the top right panel of Figure @ref(fig:figureintro), the SLGF is starch for both regression effects and error variances, but the user should allow for distinct schemes since the variance scheme appears to be {potato}{canna,corn} and the regression effect scheme appears to be {corn}{canna,potato}.

A model class describes the structure of the model including specification of effects related to the hidden groups. Model classes may include, for example, the set of models with group-based regression effects but no group-based variances; or, a single model with no group-based regression effects or variances. For example, in the smell data represented in top left panel of Figure @ref(fig:figureintro), we consider the following 62 models comprising six model classes:

A single model with a 1 degree of freedom global mean effect and homoscedastic error variance;
A single model with a 4 degree of freedom mean effect and homoscedastic error variance;
15 models (corresponding to the 15 possible grouping schemes) with a 1 degree of freedom global mean effect and group-based heteroscedastic error variances;
15 models with a 4 degree of freedom mean effect and group-based heteroscedastic error variances;
15 models with a 1 degree of freedom group-based mean effect and homoscedastic error variance;
15 models with a 1 degree of freedom group-based mean effect and group-based error variances.

For our analysis, we specified that the regression effect and variance grouping schemes must be equivalent, and that one level of the age factor could comprise a group. The user can relax these specifications as desired.

Parameter priors

With slgf, the user can choose to implement noninformative priors on the regression effects (default), or the Zellner-Siow mixture of $g$-priors on these effects. We first enumerate the noninformative priors. Let $\boldsymbol{\beta}$ represent the full set of regression effects. For simplicity, we parametrize on the precision scale where $\varphi=\frac{1}{\sigma^2}$ and the corresponding precision matrix $\varphi I_{n \times n}$ is denoted $\Phi$. For a model $m_s^c$ where $c$ indexes class and $s$ indexes grouping scheme, slgf imposes \[\label{eqn2} P(\boldsymbol{\beta},\varphi|m_s^c)\propto \varphi (\#eq:eqn2)\] for homoscedastic models, and \[\label{eqn3} P(\boldsymbol{\beta},\varphi_1, \varphi_2|m_s^c)\propto \varphi_1\cdot \varphi_2 (\#eq:eqn3)\]

for heteroscedastic models.

Alternatively, in contexts with limited data, such as the two-way unreplicated bottles data in the bottom right panel of Figure @ref(fig:figureintro), we recommend employing the Zellner-Siow mixture of $g$-prior (A. Zellner and Siow 1980; Arnold Zellner 1986; Liang et al. 2008), which reduces the minimal training sample size necessary for the computation of the fractional Bayes factor (see Subsection Fractional Bayes factors and posterior model probabilities for further detail). We have generally found that in cases where the number of data points is close to the number of parameters in some of the larger candidate models (e.g., case study 4, bottles data), the mixture of $g$-priors approach outperforms the noninformative priors due to the drastic reduction in the required proportion of the data needed to implement the fractional Bayes factor approach. For homoscedastic models, recall $\Phi=\phi I$ where $I$ is an $N\times N$ identity matrix. Let \[\label{eqn4} P(\alpha,\varphi|m_s^c)\propto\varphi (\#eq:eqn4)\] and

\[\label{eqn5} \boldsymbol{\beta}_{-\alpha}|\Phi,g,m_s^c \sim N(\boldsymbol{0},\,g(X^T\Phi^{-1} X)^{-1}). (\#eq:eqn5)\]

Next, for heteroscedastic models, first denote $\tilde{\Phi}$ as a diagonal precision matrix where the $i$th diagonal element is either $\varphi_1$ or $\varphi_2$, depending upon the grouping membership of the $i$th observation. Let \[\label{eqn6} P(\alpha, \varphi_1, \varphi_2|m_s^c)\propto \varphi_1\cdot\varphi_2 (\#eq:eqn6)\]

and

\[\label{eqn7} \boldsymbol{\beta}_{-\alpha}|\tilde{\Phi},g,m_s^c \sim N(\boldsymbol{0},\,g(X^T\tilde{\Phi}^{-1} X)^{-1}); (\#eq:eqn7)\]

In both homoscedastic and heteroscedastic cases, \[\label{eqn8} g\sim \text{InvGamma}\big(\frac{1}{2},\, \frac{N}{2}\big). (\#eq:eqn8)\]

Thus for homoscedastic models, the full prior on all parameters is the product of Equations (@ref(eq:eqn4)), (@ref(eq:eqn5)), and (@ref(eq:eqn8)). For heteroscedastic models, it is the product of Equations (@ref(eq:eqn6)), (@ref(eq:eqn7)), and (@ref(eq:eqn8)).

Fractional Bayes factors and posterior model probabilities

Note that if we form a standard Bayes factor for models using improper priors on parameters, the unspecified proportionality constants associated with the improper priors (Equations @ref(eq:eqn2), @ref(eq:eqn3), @ref(eq:eqn4), and @ref(eq:eqn6)) would not cancel one another and the Bayes factor would be defined only up to an unspecified constant. Thus we invoke a fractional Bayes factor approach (O’Hagan 1995) to compute well-defined posterior model probabilities for each model. More details follow.

The slgf package obtains posterior model probabilities through the use of fractional Bayes factors. Briefly, a Bayes factor is defined as the ratio of two models’ integrated likelihoods. The integrated likelihood is obtained by integrating parameters out of the joint distribution of data and parameters. In some cases, this integration is analytic, but in others, it is undertaken with a Laplace approximation; the corresponding simplified expressions and methods used to optimize them are described in detail later in this section. In the SLGF context, let $\mathcal{M}$ represent the full set of models under consideration, representing all classes and grouping schemes of interest. Denote $\boldsymbol{\theta}$ as the full set of unknown parameters associated with a model $m_s^c\in \mathcal{M}$ and $\pi(\boldsymbol{\theta}|m_s^c)$ as the prior on these parameters given model $m_s^c$. The parameter vector $\boldsymbol{\theta}$ depends on class and scheme of model $m_s^c$. The integrated likelihood is

\[P(\boldsymbol{Y}|m_s^c)=\int_{{\boldsymbol{\Theta}}} P(\boldsymbol{Y}|\boldsymbol{\theta},m_s^c)\pi(\boldsymbol{\theta}|m_s^c)d\boldsymbol{\theta},\]

with Bayes factor comparing models $m_s^c$ and $m_{s'}^{c'}$

\[BF=\frac{P(\boldsymbol{Y}|m_s^c)}{P(\boldsymbol{Y}|m_{s'}^{c'})}.\]

Since the priors used by the slgf package are improper, $\pi(\boldsymbol{\theta}|m_s^c)$ is defined only up to an unspecified constant. Thus, $BF$ is defined only up to a ratio of unspecified constants. To overcome this issue and enable improper priors on parameters to be used in the course of Bayesian model selection, the fractional Bayes factor (O’Hagan 1995) was developed. A fractional Bayes factor is a ratio of two fractional marginal model likelihoods, where a fractional marginal likelihood is defined as \[\label{eqn:q} q^b(\boldsymbol{Y}|m_s^c)=\frac{\int P(\boldsymbol{Y}|\boldsymbol{\theta},m_s^c)\pi(\boldsymbol{\theta}|m_s^c)d\boldsymbol{\theta}}{\int P(\boldsymbol{Y}|\boldsymbol{\theta},m_s^c)^b\pi(\boldsymbol{\theta}|m_s^c)d\boldsymbol{\theta}}. (\#eq:eqnq)\]

The $q^b(\boldsymbol{Y}|m_s^c)$ quantity in Equation (@ref(eq:eqnq)) is the integrated likelihood based on the $1-b$ fraction of the data where the improper prior has been updated to become proper with $b$ fraction of the data. Thus all normalizing constants are specified. The fractional Bayes factor is thus

\[FBF=\frac{q^b(\boldsymbol{Y}|m_s^c)}{q^b(\boldsymbol{Y}|m_{s'}^{c'})}.\]

for some fractional exponent $0<b<1$. Thus we must compute the integrals $\int P(\boldsymbol{Y}|\boldsymbol{\theta},m_s^c)\pi(\boldsymbol{\theta}|m_s^c)d\boldsymbol{\theta}$ and $\int P(\boldsymbol{Y}|\boldsymbol{\theta},m_s^c)^b\pi(\boldsymbol{\theta}|m_s^c)d\boldsymbol{\theta}$, the numerator and denominator of Equation (@ref(eq:eqnq)), respectively, for all $m_s^c\in \mathcal{M}$. Although O’Hagan (1995) provides several recommendations for choice of $b$, slgf exclusively implements $b=\frac{m_0}{N}$ where $m_0$ is the minimal training sample size required for the denominator of Equation (@ref(eq:eqnq)) to be proper for all models. If $m_0$ is too small, then the denominator of Equation (@ref(eq:eqnq)) diverges. The user must specify $m_0$; if their choice is too low, then slgf increases it until all relevant integrals converge. For further details, see O’Hagan (1995), p. 101; for recommendations on choosing $m_0$ in practice, see Subsection Choice of $\mathbf{m_0}$.

Next we discuss the technical details on how these integrals are computed via Laplace approximation. Specifically, we will describe how $\log\left(q^b(\boldsymbol{Y}|m_s^c)\right)$ is computed in each case. In the case of noninformative regression priors for homoscedastic models, $\boldsymbol{\beta}$ and $\sigma^2$ are integrated analytically. Let $\hat{\boldsymbol{Y}}$ represent the fitted values of $m_s^c$ and $\text{SSResid}$ the residual sum of squares of this model. We obtain \[\log\left(q^b(\boldsymbol{Y}|m_s^c)\right)=\left(-\frac{N(1-b)}{2}\right)(\log\pi+\log(\text{SSResid}))+\left({\frac{Nb-1}{2}}\right)\log b+\log\left(\frac{\Gamma\left(\frac{ N-P}{2}\right)}{\Gamma\left(\frac{Nb-P}{2}\right)}\right)\]

In the case of noninformative regression priors for heteroscedastic models, both the numerator and denominator integrals of Equation (@ref(eq:eqnq) )must be approximated with a Laplace approximation because although $\boldsymbol{\beta}$ can be integrated analytically, $\sigma^2_1$ and $\sigma^2_2$ cannot be. The integrals are computed on the log-scale for numeric stability. Equation (@ref(eq:eqnq)) on the log-scale simplifies to:

\[\log\left(q^b(\boldsymbol{Y}|m_s^c)\right)=\frac{N(b-1)}{2}\log(2\pi)+\frac{P+1}{2}\log b + \frac{1}{2}\log\left(\frac{|H_b^{\star}|}{|H^{{\star}}|}\right) + \log\left(\frac{P(\boldsymbol{Y}|\boldsymbol{\theta}^{{\star}})\pi(\boldsymbol{\theta}^{\star}|m_s^c)}{P(\boldsymbol{Y}|\boldsymbol{\theta}_b^{{\star}})^b\pi(\boldsymbol{\theta}^{\star}_b|m_s^c)}\right)\]

where $\boldsymbol{\theta}^{\star}$ and $H^{\star}$ denote the mode and Hessian of $P(\boldsymbol{Y}|\boldsymbol{{\theta}},m_s^c)\pi(\boldsymbol{\theta}|m_s^c)$, and $\boldsymbol{\theta}^{\star}_b$ and $H^{\star}_b$ denote the mode and Hessian of $P(\boldsymbol{Y}|\boldsymbol{{\theta}},m_s^c)^b\pi(\boldsymbol{\theta}|m_s^c)$. These modes and Hessians are computed with optim using the Nelder-Mead algorithm.

In the Zellner-Siow mixture of $g$-prior case, $\alpha$ and $\boldsymbol{\beta}_{-\alpha}$ are integrated analytically. For homoscedastic models, $\sigma^2$ is as well, and only $g$ is integrated with a Laplace approximation. Again marginal model likelihoods are computed on the log-scale. The log of the mode of $P(\boldsymbol{Y}|g,m_s^c)^b\pi(\boldsymbol{\theta}|m_s^c)$, denoted $g^{\star}_b$, is found by solving the closed-form equation $\frac{(Nb-1-P)}{2}\log(1+bg) + \frac{Nb-1}{2}\log(1+bg(1-R^2))-\frac{3}{2}\log g-\frac{N}{2g}:=0$ with the base R function uniroot where $R^2$ is the coefficient of determination for $m_s^c$. The Hessian is then evaluated at this solution $g^{\star}_b$; the closed-form Hessian of $P(\boldsymbol{Y}|g,m_s^c)^b\pi(\boldsymbol{\theta}|m_s^c)$ evaluated at $g^{\star}$ is given by $H^{\star}_b=\frac{1}{2}\left(\frac{((Nb-1)b^2(1-R^2)^2}{(1+bg^{\star})(1-R^2)^2}-\frac{(Nb-P-1)b^2}{(1+bg^{\star})^2}+\frac{3}{g^{\star^2}} - \frac{2N}{g^{\star^3}} \right)$. For $b=1$, this expression describes the numerator of Equation (@ref(eq:eqnq)); see Liang et al. (2008) for further mathematical details. The Laplace approximation for Equation (@ref(eq:eqnq)) on the log-scale then is given by:

\[\begin{split} \log\left(q^b(\boldsymbol{Y}|m_s^c)\right)=&\log\left(\frac{\Gamma\left(\frac{N-1}{2}\right)}{\Gamma\left(\frac{Nb-1}{2}\right)}\right)+\frac{Nb-1}{2}\left(\log(\text{SSTotal})+\log\pi\right)+ \frac{1}{2}\log\left(\frac{|H_b^{\star}|}{|H^{{\star}}|}\right) + \\ & \ \ \log\left(\frac{P(\boldsymbol{Y}|\boldsymbol{\theta}^{{\star}})\pi(\boldsymbol{\theta}^{\star}|m_s^c)}{P(\boldsymbol{Y}|\boldsymbol{\theta}_b^{{\star}})^b\pi(\boldsymbol{\theta}^{\star}_b|m_s^c)}\right). \end{split}\]

For heteroscedastic models, a three-dimensional Laplace approximation is used to integrate $\sigma^2_1$, $\sigma^2_2$, and $g$. To obtain $\boldsymbol{\theta}^{\star}_b$ and $\boldsymbol{\theta}^{\star}$, we first transform $\gamma_1=\log\left(\frac{1}{\sigma^2_1}\right)$ and $\gamma_2=\log\left(\frac{1}{\sigma^2_2}\right)$ to stabilize the optimization. We optimize

\[\begin{split} \log &P(\boldsymbol{Y}|g,\sigma^2_1,\sigma^2_2)^b\pi(\sigma^2_1,\sigma^2_2,g)=\frac{n_1b}{2}\log \gamma_1+\frac{n_2b}{2}\gamma_2-\frac{P}{2}\log g+\frac{1}{2}|X^T\tilde{\Sigma}X|-\\ & \frac{1}{2}\log|\frac{bg+1}{bg}X^T(\tilde{\Sigma}-Z_{\tilde{\Sigma}})X|- \\ & \frac{b}{2}\boldsymbol{Y}^T \left(\tilde{\Sigma}-Z_{\tilde{\Sigma}}-(\tilde{\Sigma}-Z_{\tilde{\Sigma}})X\left(\frac{bg+1}{bg}X^T\tilde{\Sigma}X-X^T Z_{\tilde{\Sigma}}X\right)^{-1}X^T(\tilde{\Sigma}-Z_{\tilde{\Sigma}}) \right) \boldsymbol{Y}-\\ & \frac{3}{2}\log(g)-\frac{N}{2g}+\log(J) \end{split}\]

using the Nelder-Mead method from optim where $Z_{\tilde{\Sigma}}=\tilde{\Sigma}Z(Z^T \tilde{\Sigma} Z)^{-1}Z^T \tilde{\Sigma}$, $Z=\boldsymbol{1}^T$, and $\log(J)=-(\gamma_1+\gamma_2)$ represents the determinant of the log-precision transformation. For $b=1$ these equations yield integrand of the numerator of (@ref(eq:eqnq)).

With the modes computed, the Hessians of $\log P(\boldsymbol{Y}|g,\sigma^2_1,\sigma^2_2)^b\pi(\sigma^2_1,\sigma^2_2,g)$ are calculated with the function Hessian from the package numDeriv. Finally with the modes and Hessians computed, the Laplace approximation for Equation (@ref(eq:eqnq)) is given by:

\[\log \left(q^b(\boldsymbol{Y}|m_s^c)\right)=\frac{Nb-1}{2}\log(2\pi)+\frac{P+1}{2}\log(b)+ \frac{1}{2}\log\left(\frac{|H_b^{\star}|}{|H^{{\star}}|}\right) + \log\left(\frac{P(\boldsymbol{Y}|\boldsymbol{\theta}^{{\star}})\pi(\boldsymbol{\theta}^{\star}|m_s^c)}{P(\boldsymbol{Y}|\boldsymbol{\theta}_b^{{\star}})^b\pi(\boldsymbol{\theta}^{\star}_b|m_s^c)}\right).\]

For the sake of consistency, all models, even with fully tractable marginal model likelihoods, are computed with a FBF. Once log-fractional marginal likelihoods have been computed for all models, we subtract the maximum from this set so that the set of log-fractional marginal likelihoods has been rescaled to have a maximum of 0. Each value is exponentiated to obtain a set of fractional marginal likelihoods with maximum 1. This adjustment helps to avoid numerical underflow when computing posterior model probabilities.

Choice of $\bf{m_0}$

The user must specify the argument $m_0$, the minimal training sample size such that all marginal model likelihoods are well-defined. If prior="flat", then we recommend that the user begins by letting m0 equal the dimension of the improper priors: that is, the number of coefficients in most complex model under consideration plus the number of variances under consideration. If prior="zs", then m0 can generally be much smaller (in practice, we have found that m0=2 performs well) as the prior on the regression effects is proper. If the user’s choice is too low, then ms_slgf will incrementally increase it by 1 until all marginal model probabilities are numerically stable. If m0 reaches $n$, corresponding to 100% of data used for training, ms_slgf will terminate and the user should specify a different set of models.

Model priors

With this adjusted set of fractional marginal likelihoods, we next consider the priors for the model space. The function ms_slgf imposes a uniform prior by model class, and for classes containing multiple models, the prior on each class is uniformly divided among the models it contains. We finally compute posterior model probabilities for each model:

\[P(m^{\prime}|\boldsymbol{Y})=\frac{P(\boldsymbol{Y}|m^{\prime})P(m^{\prime})}{\underset{\mathcal{M}}{\sum}P(\boldsymbol{Y}|m)P(m)}.\]

The prior probability placed on each model can be found in the models$ModPrior vector in output from ms_slgf.

Parameter estimation

Our approach provides maximum a posteriori (MAP) estimates for all relevant quantities: $\hat{\boldsymbol{\beta}}$, $\hat{\boldsymbol{\sigma}}^2=\{\hat{\sigma}^2\}$ or $\hat{\boldsymbol{\sigma}}^2=\{\hat{\sigma}^2_1,\hat{\sigma}^2_2\}$ in the homoscedasitc and heteroscedastic cases respectively, and $g$ in the Zellner-Siow mixture of $g$-prior case.

Because the prior on $\boldsymbol{\beta}$ is either flat or centered at 0, the MAP estimator is simply the usual maximum likelihood estimator:

\[\hat{\boldsymbol{\beta}}=\underset{\boldsymbol{\beta}}{\arg\max}P(\boldsymbol{Y}|X,\boldsymbol{\beta},\Sigma)\] so that $\hat{\boldsymbol{\beta}}=(X^TX)^{-1}X^T\boldsymbol{Y}$. The variance(s) and $g$ were computed via the base R function optim during the Laplace approximation stage. For computational efficiency, ${\boldsymbol{\beta}}$ is integrated out of $P(\boldsymbol{Y}|X,\boldsymbol{\theta})P(\boldsymbol{\theta})$ and the variances are estimated on the log-scale, so we let $\hat{\boldsymbol{\lambda}}:=\{\hat{\lambda}\}$ in homoscedastic models or $\{\hat{\lambda}_1,\hat{\lambda}_2\}$ in heteroscedastic models. Then

\[\hat{\boldsymbol{\lambda}}=\underset{\boldsymbol{\lambda}}{\arg\max}\int P(\boldsymbol{\mathbf{Y}}|X,\boldsymbol{\beta},\Sigma)P(\boldsymbol{\beta})P(\Sigma)d\boldsymbol{\beta}\] or, \[\{\hat{\boldsymbol{\lambda}},\hat{g}\}=\underset{\boldsymbol{\lambda},g}{\arg\max}\int P(\boldsymbol{\mathbf{Y}}|X,\boldsymbol{\beta},\Sigma,g)P(\alpha)P(\boldsymbol{\beta}_{-\alpha}|\Sigma)P(g)d\boldsymbol{\beta}.\] Then, $\hat{\boldsymbol{\sigma}}^2=\exp\{\hat{\boldsymbol{\lambda}}\}$ for $\hat{\boldsymbol{\sigma}}^2=\{\hat{\sigma}^2\}$ or $\hat{\boldsymbol{\sigma}}^2=\{\hat{\sigma}^2_1,\hat{\sigma}^2_2\}$. The output values coefficients, variances, and gs (only if prior="zs") are lists where each element contains the estimates for each model’s $\hat{\boldsymbol{\beta}}$, $\hat{\boldsymbol{\sigma}^2}$, and $\hat{\bf{g}}$, respectively.

Using the slgf package

The function ms_slgf() is the main function of slgf that implements the methodology we have described. Each argument of ms_slgf() and its output will be illustrated in the case studies found in Subsections Case study 1: smell data, Case study 2: textile data, Case study 3: locknut data, and Case study 4: bottles data. The ms_slgf() function requires several inputs to compute and output posterior model probabilities for all models, schemes, and model classes of interest. The user begins with a data.frame containing a continuous response, at least one categorical predictor, and any other covariates of interest. The data.frame cannot contain column names with the character string group, because ms_slgf() will search for this string when fitting group-based models. The user must first identify an SLGF for the fixed effects and/or the variance. The user indicates, via the arguments response, slgf_beta, and slgf_Sigma, character strings corresponding to the response, the suspected latent fixed effect grouping factor, and the suspected latent variance grouping factors, respectively. If no latent regression effect structure or variance structure is to be considered, the user may specify slgf_beta=NA, slgf_Sigma=NA, or both. We note that if the user does not specify any SLGFs, the model selection is still undertaken through fractional Bayes factors as described previously. If the user chooses the same categorical variable for both latent grouping factors, the argument same_scheme, which defaults to FALSE, can indicate whether the grouping schemes for the regression effect and variance structures must be equivalent.

Next the user determines the model classes they wish to evaluate. The argument usermodels is a list where each element contains a string of R class formula or character. The user also specifies which classes should also be considered in a heteroscedastic context via the argument het, which is a vector of the same length as usermodels, containing an indicator 1 or 0 corresponding to each model class specified in usermodels where 1 indicates the model will be considered with group-based variances and 0 indicates it will not. Together the arguments usermodels and het indicate which fixed effect structures are of interest, and which should be further considered for heteroscedasticity, thus implicitly creating the full set of model classes considered.

Next the user chooses a prior to place on the regression effects. As described in Subsection Parameter priors, prior="flat" (the default) implements the noninformative prior and prior="zs" imposes the Zellner-Siow mixture of $g$-prior.

Finally the user must specify the minimum number of levels of the SLGF that can comprise a group, via the arguments min_levels_beta and min_levels_Sigma, which default to 1. The number of possible grouping schemes increases with the number of levels of the SLGF. To speed up the computation, the user can increase these arguments and thus reduce the number of candidate models. Because we partition into two groups, note these arguments may not exceed half the number of levels of the SLGF. Additionally, when considering data with limited degrees of freedom, increasing min_levels_beta and/or min_levels_Sigma may be necessary to ensure effects can be computed.

Case Study 1: smell data

First we revisit the smell data set analyzed by O’Brien and Heft (1995). They measured olfactory acuity (denoted olf) on a continuous scale as a function of age (agecat), where age groups were divided into five categorical levels. See Figure @ref(fig:figuresmell). We note that levels 4 and 5 of agecat appear to have larger variance than levels 1, 2, and 3, but standard analysis of variance models assume homoscedasticity. We first demonstrate how a classical analysis might misrepresent the data. A usual one-way ANOVA analysis compares the null model, with a single mean, against the alternative model, with 4 degrees of freedom for the mean effects, with homoscedastic error variance.

The smell data is analyzed for group-based means and variances. We find posterior model probability of 61% for the model with group-based means and variances with scheme {1,2,3}{4,5}. We also find overall posterior probability of grouping scheme

 % remove smell null model
> smell$agecat <- as.factor(smell$agecat) # coerce agecat to a factor variable
> smell_null <- lm(olf~1, data=smell) # fit a null model with a single mean
> smell_full <- lm(olf~agecat, data=smell) # fit a full model with a 4 agecat effects
> print(smell_null) 
Call:
lm(formula = olf ~ 1, data = smell)

Coefficients:
(Intercept)  
      1.234 
> print(smell_full) 
Call:
lm(formula = olf ~ agecat, data = smell)

Coefficients:
(Intercept)      agecat2      agecat3      agecat4      agecat5  
    1.31689      0.02824     -0.01075     -0.11580     -0.25728  
> anova(smell_null, smell_full) # compare the null and full models
Analysis of Variance Table

Model 1: olf ~ 1
Model 2: olf ~ agecat
  Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
1    179 7.7585                                  
2    175 5.6197  4    2.1388 16.651 1.395e-11 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> summary(smell_null)$sigma^2
0.04334349
> summary(smell_full)$sigma^2
0.03211259

This approach, which assumes all levels of agecat have equal error variance, favors the model with a 4 degree of freedom agecat effect. Note we obtain maximum likelihood estimates for the error variance of $\hat{\sigma}^2_{\text{full}}=0.03211$. Based on Figure @ref(fig:figuresmell), we suspect this value may overestimate the error variance for levels 1, 2, and 3, while underestimating that of levels 4 and 5. We also suspect that the full model may be overly complex, as the means for levels 1, 2, and 3 appear to be plausibly equivalent. That is, the apparent latent grouping scheme for both regression effects and error variances is {1,2,3}{4,5}, or equivalently, {4,5}{1,2,3}.

Next, consider the slgf approach. We will consider the classes of models with group-based means, group-based variances, and both group-based means and variances. We specify dataf=smell and response="olf", along with slgf_beta="agecat" and slgf_Sigma="agecat" as the suspected latent grouping factor for both regression effects and variances. We set the minimum number of levels for a group to 1 with min_levels_beta=1 and min_levels_Sigma=1. Note that fewer grouping schemes would be considered if we let these arguments equal 2. For simplicity, since the mean and variance grouping schemes both visually appear to be {1,2,3}{4,5}, we will restrict the schemes to be equivalent with same_scheme=TRUE. Via the usermodels argument, we will consider the null model class olf$\sim$1, the full model class olf$\sim$agecat, and the group-means model class olf$\sim$group, which will automatically consider all possible grouping schemes. Similarly, we will consider each of these formulations with the class of both homoscedastic and group-based variances via the argument het=c(1,1,1). With a relatively large amount of data, we will use the uninformative prior="flat". Finally we specify a minimal training sample size of m0=9, although if we specify this value to be too small, ms_slgf() will automatically increase it to the smallest value for which the relevant integrals converge and/or the necessary optimizations can be performed. We run ms_slgf to obtain the posterior model probabilties for all 62 models under consideration. We inspect the two most probable models, with indices 62 and 32, which comprise over 99% of the posterior probability over the model space considered:

> smell_out <- ms_slgf(dataf=smell, response="olf", lgf_beta="agecat", 
                       min_levels_beta=1, lgf_Sigma="agecat", 
                       min_levels_Sigma=1, same_scheme=TRUE, 
                       usermodels=list("olf~1", "olf~agecat", "olf~group"), 
                       het=c(1,1,1), prior="flat", m0=9)
> smell_out$models[c(1,2),c(1,2,3,5,7)]
        Model  Scheme.beta Scheme.Sigma  FModProb Cumulative
62  olf~group {4,5}{1,2,3} {4,5}{1,2,3} 0.6054935  0.6054935
32 olf~agecat         None {4,5}{1,2,3} 0.3878754  0.9933688

The most probable model, as suspected, is olf$\sim$group, indicating group-based means where Scheme.beta is {4,5}{1,2,3}. Note also Scheme.Sigma indicates group-based heteroscedasticity with the same scheme. This model received posterior probability of approximately 61%. The next most probable model also has group-based heteroscedasticity with scheme {4,5}{1,2,3}, but note the model is olf$\sim$agecat, containing the full model not with group-based mean effects, but rather 4 degrees of freedom for the agecat effect. By inspecting smell_out$scheme_probabilities_Sigma, we see that models with variance grouping scheme {4,5}{1,2,3} comprise over 99% of the posterior probability. By contrast, the models with fixed effect grouping scheme {4,5}{1,2,3} (that is, the homoscedastic and heteroscedastic versions) comprise 61% of the posterior probability. We find these posterior probabilities intuitive, easy to interpret quantifications of uncertainty.

The output fields coefficients and variances contain lists with the coefficients and variance(s) associated with each model. The output field model_fits contains the output from a linear model fit to the model specification in question, containing the , and Note the most probable model has index 62, so we inspect the 62nd elements of the coefficient and variance lists smell_out$coefficients and smell_out$variances, which contain the MAP estimates for each model’s regression effects and variance(s), respectively. The group-based variance estimates are $\hat{\sigma}^2_{\text{\{4,5\}}}=0.0587$ and $\hat{\sigma}^2_{\text{\{1,2,3\}}}=0.0121$. We contrast these variances against the estimate $\hat{\sigma}^2_{\text{full}}=0.032$, which appears to have overestimated the variance of levels 1, 2, and 3, while simultaneously underestimating that of levels 4 and 5.

> smell_out$coefficients[[62]]
(Intercept)  group{4,5} 
  1.3252211  -0.1940328
> smell_out$variances[[62]]
     {4,5}    {1,2,3} 
0.05868885 0.01211084

Case study 2: textile data

We reanalyze the breaking strength data set of Furry (1939), also investigated by Metzger and Franck (2021), to illustrate the additional flexibility of slgf beyond the original work. The breaking strength of a starch film strength (measured in grams) is analyzed according to the thickness of the film, denoted film (measured in $10^{-4}$ inches), and the type of starch starch used to create the film (canna, corn, or potato). As usual, we begin by plotting the data to ascertain whether there is a latent grouping factor present. By inspection we note that the potato films, represented by squares in Figure @ref(fig:figurechips), appear to have a higher variability than the corn (filled red circles) and canna (filled gray triangles) films.

The breaking strength data set from represents the breaking strength of starch films depending on the thickness of a film coating and the type of starch used to make the film. The left panel shows the data. The center panel shows an ANCOVA model. The right panel shows the most probable model ($P(m|\boldsymbol{Y})\approx 66\%$) containing a latent group-based interaction between groups {canna, potato}{corn} (gray points vs. red points) and film thickness, as well as distinct variances between groups {canna, corn}{potato} (filled points vs. open points).

We first illustrate a typical ANCOVA approach, in which three parallel lines for each level of starch are fit with a common error variance. This model leads to the fit shown in the center panel of Figure @ref(fig:figurechips). Note only the film thickness effect is statistically significant according to a traditional hypothesis testing approach with $\alpha=0.05$. The residual standard error of this model is $\hat{\sigma}^2_{\text{ANCOVA}}=27126.09$.

\label{example:ancovaexample}
> textile_ancova <- lm(strength~film+starch, data=textile)
> summary(textile_ancova)

Call:
lm(formula = strength ~ film + starch, data = textile)

Residuals:
    Min      1Q  Median      3Q     Max 
-203.63  -99.45  -57.84   56.72  637.61 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)    158.26     179.78   0.880 0.383360    
film            62.50      17.06   3.664 0.000653 ***
starchcorn     -83.67      86.10  -0.972 0.336351    
starchpotato    70.36      67.78   1.038 0.304795

We contrast these findings against our methodology with slgf. The following arguments are input: dataf=textile specifies the data frame; response="strength" specifies the column of textile that contains the response variable; slgf_beta="starch" and slgf_Sigma="starch" indicate that the categorical variable starch should be used as the latent grouping factor for both regression effects and variances; same_scheme=FALSE indicates that the latent regression effect and variance grouping structures do not need to be partitioned by the same levels of starch; min_levels_beta=1 and min_levels_Sigma=1 indicate that a latent group can contain only one level of starch; the usermodels argument indicates that we will consider main effects models strength$\sim$film+starch and strength$\sim$film+starch+film*starch, and models with group-based regression effects including
strength$\sim$film+group and strength$\sim$film+group+film*group; the argument het=c(1,1,1,1) indicates that each of these four model specifications will also be considered with group-based variances; prior="flat" places a flat prior on the regression effects; and m0=8 specifies the minimal training sample size.

> data(textile)
> out_textile <- ms_slgf(dataf = textile, response = "strength", 
               lgf_beta = "starch", lgf_Sigma = "starch",
               same_scheme=FALSE, min_levels_beta=1, min_levels_Sigma=1, 
               usermodels = list("strength~film+starch", "strength~film*starch",
                                 "strength~film+group", "strength~film*group"), 
               het=c(1,1,1,1), prior="flat", m0=8)
> out_textile$models[1:5,c(1,2,3,5)]
                  Model          Scheme.beta         Scheme.Sigma      FModProb   
31  strength~film*group {corn}{canna,potato} {potato}{canna,corn}  0.6596667376 
8  strength~film*starch                 None {potato}{canna,corn}  0.3337588991 
30  strength~film*group {canna}{corn,potato} {potato}{canna,corn}  0.0018692078 
28  strength~film*group {corn}{canna,potato} {corn}{canna,potato}  0.0010854755
7  strength~film*starch                 None {corn}{canna,potato}  0.0006831597

Refer to code and output above, where we provide the five most probable models. Note the three most probable models all have the latent variance grouping scheme {potato}{canna, corn}; again over 99% of the posterior model probability is accounted for by this variance scheme. This visually agrees with the plot, which shows that the potato starch films seem to have higher variability than the canna and corn starch films. The regression effect structure is less clear: the most probable model selects the film*group model, which contains main effects for film and group as well as their interaction, with scheme {canna}{corn, potato}. We plot this model in the right panel of Figure @ref(fig:figurechips) to illustrate its plausibility. It does appear that the slope for corn is steeper than that of potato and canna, which can be contracted into a single level to simplify the model. However, the error variance for potato appears larger than that of canna and potato, as evidenced by the large spread of square potato points around the gray line. Thus we assert that the most probable model under our methodology is reasonable and appropriate. The group standard errors are $\sigma^2_{\text{\{potato\}}}=57734.046$ and $\sigma^2_{\text{\{canna,corn\}}}=5791.713$, indicating the standard ANCOVA model underestimates the error variance of the potato observations, and overestimates those of the canna and corn observations.

Finally we illustrate the output scheme_probabilities_beta and scheme_probabilities_Sigma, which sum up the probabilities for all model specifications associated with each possible grouping scheme. We see moderately high cumulative probability for models with regression grouping scheme {corn}{canna,potato}, followed closely be models with no grouping scheme for regression effects:

> out_textile$scheme_probabilities_beta
           Scheme.beta  Cumulative
2 {corn}{canna,potato} 0.592860983
4                 None 0.403632744
1 {canna}{corn,potato} 0.002502435
3 {potato}{canna,corn} 0.001003838

Intuitively, based on the wider spread of the square potato points in Figure @ref(fig:figurechips), we see high cumulative probability for the variance grouping scheme {potato}{canna,corn}:

> out_textile$scheme_probabilities_Sigma
          Scheme.Sigma   Cumulative
3 {potato}{canna,corn} 9.975853e-01
2 {corn}{canna,potato} 2.184257e-03
1 {canna}{corn,potato} 2.304323e-04
4                 None 1.632320e-08

Case study 3: locknut data

We consider the two-way replicated layout of Meek and Ozgur (1991), where the torque (torque) required to tighten a locknut was measured as a function of a plating process (plating) and a threading method (fixture).

The most probable model ($P(m|Y)\approx 85\%$) contains a full fixture by plating interaction effect with no grouping structure, and group-based variances based on the levels of this interaction with scheme {boltCW, mandrelCW, mandrelHT, mandrelPO}{boltHT, boltPO} (filled points vs. open points).

A two-way analysis with an interaction yields the following ANOVA table. The fixture and plating main effects, along with fixture by plating interaction, are all statistically significant at level $\alpha=0.005$. Additionally, we find $\hat{\sigma}^2_{\text{Full}}=36.58$:

> anova(lm(Torque~Fixture+Plating+Fixture*Plating, data=locknut))  
Analysis of Variance Table  

Response: Torque
                Df Sum Sq Mean Sq F value    Pr(>F)    
Fixture          1  821.4  821.40 22.4563 1.604e-05 ***
Plating          2 2290.6 1145.32 31.3118 9.363e-10 ***
Fixture:Plating  2  665.1  332.55  9.0916 0.0003952 ***
Residuals       54 1975.2   36.58

Upon inspection of Figure @ref(fig:figurelocknut), we suspect that two latent characteristics are at play. First, based on the non-parallel lines representing the plating effects, there may be a group-by-plating interaction, so we will consider slgf_beta="Plating". Note since fixture has only two levels, it is not feasible to consider group-based effects based on fixture since the one degree of freedom fixture effect would be equivalent to a group effect.

Regarding the variance structure, the variance of the torque amount at levels PO and HT appears higher, but only for the bolt fixture. This suggests that the levels of the interaction govern the variance groups; that is, slgf_Sigma="Fixture*Plating". Since this specific variable header does not appear in the locknut data set, we manually create a new variable with each interaction level by pasting together the main effect variables:

locknut$Interaction <- paste0(locknut$Fixture, "*", locknut$Plating)

Thus we consider the following model specifications. Liang et al. (2008) (p. 420) note that the Zellner-Siow mixture of $g$-prior provides a fully Bayesian, consistent model selection procedure for small $n$ along with relatively straightforward expressions for the marginal model probabilities. This approach is implemented by the user with the argument prior="zs":

> data(locknut)
> locknut$Interaction <- paste0(locknut$Fixture, "*", locknut$Plating)
> out_locknut <- ms_slgf(dataf=locknut, response="Torque", same_scheme=FALSE, 
                       lgf_beta="Plating", min_levels_beta=1,
                       lgf_Sigma="Interaction", min_levels_Sigma=1, 
                       usermodels=list("Torque~Fixture+Plating+Fixture*Plating", 
                                       "Torque~Fixture+group+Fixture*group"), 
                       het=c(1,1), prior="zs", m0=2)

This formulation favors the same main and interaction effects favors by the standard model. However, slgf favors group-based variances with scheme {bolt*HT, bolt*PO}{bolt*CW, mandrel*CW, mandrel*HT, mandrel*PO} with posterior probability of approximately 85%. This variance structure was expected based on the relatively larger spread of the open points in Figure @ref(fig:figurelocknut). As we have noted previously, the group variance estimates show that the heteroscedastic model overestimates the variance for some levels of fixture and plating, and underestimates it for others. Since model ‘13’ was the model probable model, we print these variances, obtaining $\hat{\sigma}^2_{\text{bolt*HT,bolt*PO}}\approx 85.0$ and $\hat{\sigma}^2_{\text{bolt*CW,mandrel*CW,mandrel*HT,mandrel*PO}}\approx 11.6$:

> out_locknut$variances[[13]]
{bolt*HT,bolt*PO} {bolt*CW,mandrel*CW,mandrel*HT,mandrel*PO} 
         85.00448                                   11.58652

Case study 4: bottles data

Finally, we consider the data of Ott and Snee (1973), where a machine with six heads (head) is designed to fill bottles (weight). The weight of each bottle is measured once over five time points (time) as a two-way unreplicated layout. A visual inspection of the data (Figure @ref(fig:figurebottles), left panel) indicates that one of the filling heads is behaving distinctly than the other five. There appears to be an interaction between head and time, but we lack the degrees of freedom to fit such a model. If we were to fit the standard main effects model, we obtain the clearly inappropriate model fit in the center panel of Figure @ref(fig:figurebottles).

Since it appears that head {5} is out of calibration in some way as compared to heads {1,2,3,4,6}, we instead consider the group-based interaction model weight$\sim$time+group:time where ‘head’ is the regression effect SLGF. For this illustration, we consider only homoscedastic models. In this data-poor context, we recommend the use of the Zellner-Siow mixture of $g$-prior by specifying prior="zs" in the ms_slgf function. The minimal training sample size can be much lower, as this prior is proper. We inspect the posterior model probabilities of the most probable model and the additive main effects model:

> bottles_me <- lm(weight~time+heads, data=bottles)
> bottles2 <- data.frame(weight=bottles$weight, time=as.factor(bottles$time),
                         heads=as.factor(bottles$heads))
> bottles_out <- ms_slgf(dataf=bottles2, response="weight", lgf_beta="heads", 
                 min_levels_beta=1, lgf_Sigma=NA, min_levels_Sigma=NA, same_scheme=FALSE, 
                 usermodels=list("weight~time+group:time",  "weight~time+heads"),
                 het=c(0,0), prior="zs", m0=2)
> bottles_out$models[1:2,c(1,2,4,5)]
                   Model    Scheme.beta Log-Marginal  FModProb
5 weight~time+group:time {5}{1,2,3,4,6}     -103.168    0.9991932
32     weight~heads+time           None     -114.726 0.0002158313

The group-based approach overwhelmingly favors the model with a main effect for ‘time’ along with the group-based interaction ‘group:time’ with scheme {5}{1,2,3,4,6}. We also note that the error variance for the main effects model is $\hat{\sigma}^2_{\text{ME}}=130.1233$, while the estimate for the group-based interaction model is $\hat{\sigma}^2_{\text{\{5\}\{1,2,3,4,6\}}}=39.76$, suggesting the main effects model seriously overestimates the error variance and thus may lead to misleading inference on regression effects.

The bottles data set from represents fill weights by six machine heads over five time points. The left panel shows the data, with head 5 appearing to be out of calibration. The center panel shows a main effects model, with a realistic fit for heads 1, 2, 3, 4, and 6, but not 5. The right panel shows the most probable group-based interaction ($P(m|\boldsymbol{Y})> 99.9\%$) with main effects for time and a group-by-time interaction with scheme {5}{1,2,3,4,6}.

We note that there will be a linear dependency between the group-by-time interaction and the time main effect for time 5. The NA values can be seen by inspecting the coefficients of the corresponding model. These effects are not counted in the dimensionality of the model when computing $q^b(\boldsymbol{Y}|m)$.

> bottles_out3$coefficients[[5]]
           (Intercept)            heads2                    heads3            heads4 
                 53.24              1.80                      4.80             -6.80 
                heads5            heads6    group{1,2,3,4,6}:time1    group{5}:time1 
                 -8.24             -1.00                     14.00            -13.00 
group{1,2,3,4,6}:time2    group{5}:time2    group{1,2,3,4,6}:time3    group{5}:time3 
                 -1.40             20.00                     -8.20              4.00 
group{1,2,3,4,6}:time4    group{5}:time4    group{1,2,3,4,6}:time5    group{5}:time5 
                 27.40            -11.00                        NA                NA

Conclusion

This manuscript has provided an overview of the slgf package in R, which is available from the Comprehensive R Archive Network. Source code can be found on Github at https://github.com/metzger181osu/slgf. The slgf package allows the user to determine whether latent groupings of categorical predictor’s levels provide a better characterization of the response variable compared with ordinary linear models that do not account for the suspected latent groupings. This is accomplished through the suspected latent grouping factor methodology of (Metzger and Franck 2021). The methodology allows for formal comparisons between ordinary linear models and latent grouping models, which protects the user from automatically selecting a spurious clustering structure that is not well supported by the data. We illustrate the ability to detect the lack of a grouping structure in the simulation studies of Metzger and Franck (2021).

The slgf package allows the user to (i) explore different grouping schemes for fixed effects and error variances, and (ii) specify entirely separate latent grouping factors for fixed effects and variances. We illustrate (i) in Case Study 2: Textile data, where the top model shows a different regression line for corn compared to canna and potato, but the error variance for potato is different from canna and corn (see Figure @ref(fig:figurechips)). To show (ii), we considered the locknut example of Subsection Case study 3: locknut data, where we considered whether fixture (bolt, mandrel) exhibited a fixed effect latent grouping structure, and whether interaction (bolt*CW, bolt*HT, bolt*PO, mandrel*CW, mandrel*HT, mandrel*PO) exhibited a variance latent grouping structure. As described in Subsection Case study 3: locknut data, we found no latent grouping structure for fixed effects, but torque error variance for bolt*HT and bolt*PO differ from the other interaction levels. The analysis supported no latent grouping structure for plating.

The slgf package provides functionality to detect plausible underlying cluster structures among levels of categorical predictors in linear models. This exercise in cluster detection is in some ways similar to considering a finite mixture model. R packages already exist to fit finite mixture models using the EM algorithm, such as mixtools (Benaglia et al. 2009). The flexmix package (Gruen and Leisch 2023) in particular is notable for its ability to fit mixture models to regression data (including Gaussian, binomial, and Poisson models). Additionally, the package MultiLCIRT also considers latent variables for the item response theory setting; see Bartolucci, Bacci, and Gnaldi (2014), who use BIC for model selection rather than fractional Bayes factors.

In contrast to fitting finite mixture models for the purpose of parameter estimation and inference, slgf assesses the plausibility of cluster structures for small to medium-sized data sets via model selection. Additionally, slgf can avoid problems with convergence in the EM algorithm that may arise in small-sample scenarios, particularly when the number of data points is relatively low and the model being fit (e.g., a two component mixture model) is larger than the actual model generating the data (e.g., a one component mixture model with no cluster structure).

By contrast, slgf circumvents convergence issues by considering all possible groupings of points within the user-specified model classes, obtaining integrated likelihoods and posterior model probabilities for each model, and quantifying overall probability of a cluster structure as the sum of all posterior probabilities for models with two groups by the law of total probability. The slgf package thus excels in smaller-data settings where assessing the plausibility of a cluster structure is the core goal, and packages like flexmix will excel in cases where the main goal is to fit specified mixture models and conduct inference on parameters.

In addition to the basic slgf demonstration shown in Case Study 1: Smell data, we illustrate slgf functionality for mixtures of $g$-priors (Liang et al. 2008) and a two way unreplicated layout in Case Study 4: Bottles data. Mixtures of $g$-priors have been shown to work well with fractional Bayes factor methods to reduce the training fraction when sample size is small relative to the number of model parameters (Metzger and Franck 2021).

Finally, although the methodology described here and in Metzger and Franck (2021) exclusively handles two latent groups, we call on any readers with a compelling data set that may exhibit more than two latent groups to contact the authors so that we might explore a generalization of our method to more than two groups.

We have provided an overview of functionality that we hope will enable scientists from diverse fields to access the SLGF methodology of (Metzger and Franck 2021) via the slgf package to detect hidden groupings in the levels of categorical predictors that might impact outcomes of interest across a wide range of human endeavors.