Abstract
The bayesdfa package provides a flexible Bayesian modeling framework for applying dynamic factor analysis (DFA) to multivariate time-series data as a dimension reduction tool. The core estimation is done with the Stan probabilistic programming language. In addition to being one of the few Bayesian implementations of DFA, novel features of this model include (1) optionally modeling latent process deviations as drawn from a Student-t distribution to better model extremes, and (2) optionally including autoregressive and moving-average components in the latent trends. Besides estimation, we provide a series of plotting functions to visualize trends, loadings, and model predicted values. A secondary analysis for some applications is to identify regimes in latent trends. We provide a flexible Bayesian implementation of a Hidden Markov Model — also written with Stan — to characterize regime shifts in latent processes. We provide simulation testing and details on parameter sensitivities in supplementary information.A goal of many multivariate statistical techniques is to reduce dimensionality in observed data to identify shared or latent processes. Factor analysis models represent a general class of models used to relate multiple observations to a lower dimension (factors), while also considering different covariance structures of the observed data. Factors are not directly observed, but represent a hidden, shared process among variables. Though goals of factor analysis are sometimes similar to techniques such as principal component analysis (PCA), factor analysis models explicitly estimate residual error terms, whereas PCA does not (T. W. Anderson and Rubin 1956; Jolliffe 1986). These factor models are written as \({ y }_{ i }={ u }_{ i }+\textbf{ Z }{f}_{ i }+{ \varepsilon }_{ i }\), where observed data \({ y }_{ i }\) is a linear combination of an intercept \({ u }_{ i }\) and the product of latent factors \({f}_{ i }\) and loadings \(\textbf{ Z }\) (loadings are sometimes referred to in the literature as \(\textbf{L}\)).
In a time-series setting, factor models may be extended to dynamic factor analysis (DFA) models. DFA models aim to reduce the dimensionality of a collection of time series by estimating a set of shared trends and factors, representing the linear effects of each trend on the observed data (Molenaar 1985; Zuur et al. 2003; Stock and Watson 2005). The number of trends \(m\) is chosen to be less or equal than the number of time series \(n\). The general form of the DFA model can be formulated as a state-space model (Petris 2010). The latent processes (also referred to as ‘trends’) are generally modeled as random walks, so that trend \(i\) is modeled as \({ x }_{ i,t+1 }={ x }_{ i,t }+{ w }_{ i,t }\) where \({ x }_{ i, t }\) is the value of the \(i\)-th latent trend at time \(t\), and the deviations \({ w }_{ i,t }\) are modeled as white noise. Across trends, these deviations are modeled as \(\textbf{w}_{t} \sim \mathrm{MVN}(0, \textbf{Q})\). The latent trends \({ x }_{ i,t }\) are linked to data via a loadings matrix \(\textbf{Z}\) whose values do not evolve through time, \({ y }_{ t }=\textbf{Z}{ x }_{ t } + \textbf{a} + \textbf{B}{ d }_{ t } + \textbf{e}_{t}\). The loadings matrix \(\textbf{Z}\) is dimensioned \(n\: \times\: m\) so that \({Z}_{j,i}\) represents the effect of trend \(i\) on time series \(j\). The parameters \(\textbf{a}\) and \(\textbf{B}\) are optional parameters, representing time-series-specific intercepts and effects of covariates, \({d}_{t}\). Finally, the residual errors are assumed to be \(\textbf{e}_{t} \sim \mathrm{MVN}(0,\textbf{R})\), where \(\textbf{R}\) is an estimated covariance matrix.
Estimation of DFA models is typically done in a maximum likelihood framework, using the expectation-maximization (EM) algorithm or other optimization tools. Implementation of these methods is available in multiple R packages including dlm (Petris 2010), KFAS (Helske 2017), MARSS (Holmes, Ward, and Wills 2012), and tsfa (Gilbert and Meijer 2005). Challenges in parameter estimation and interpretation for DFA models have been well studied. Without constraints, parameters in the DFA model are not identifiable (Harvey 1990; Zuur et al. 2003). To ensure identifiability of variance parameters, for example, the covariance matrix \(\textbf{Q}\) is generally fixed as an identity matrix (Harvey 1990). To avoid confounding the latent trends and loadings matrix \(\textbf{Z}\), elements of \(\textbf{Z}\) must also be constrained. A common choice of constraints is for the elements in the first \(m-1\) rows of \(\textbf{Z}\) to be set to zero if the column index is greater than the row index, \(j > i\) (Harvey 1990), though other constraints have been proposed (Bai and Wang 2015). For a 3-trend DFA model for instance, these constraints would mean that the \(\textbf{Z}\) matrix parameters would be configured as \[\begin{matrix} \begin{bmatrix} { Z }_{ 1,1 } & 0 & 0 \\ { Z }_{ 2,1 } & { Z }_{ 2,2 } & 0 \\ { Z }_{ 3,1 } & { Z }_{ 3,2 } & { Z }_{ 3,3 } \\ ... & ... & ... \end{bmatrix} \end{matrix}.\]
Several previous approaches to DFA estimation in a maximum likelihood framework also center (subtract the sample means) or standardize (subtract the sample means and divide by the sample standard deviations) data prior to fitting DFA models and set the intercepts \(a\) equal to zero to avoid potential confounding of level parameters (Holmes, Ward, and Scheuerell 2012). We adopt a similar approach, allowing users to either center or standardize data before estimation, and not including the intercepts as estimated parameters.
We developed our DFA model in a Bayesian framework, using Stan and the package rstan (Stan Development Team 2016), which implements Markov chain Monte Carlo (MCMC) using the No-U Turn Sampling (NUTS) algorithm (Hoffman and Gelman 2014; Carpenter et al. 2017). Although estimation of the DFA model in a Bayesian setting is not new (Aguilar and West 2000; Koop and Korobilis 2010; Stock and Watson 2011), it presents several interesting challenges over the EM algorithm. In addition to the constraints on \(\textbf{Q}\) and \(\textbf{Z}\), Bayesian estimation suffers from a problem of label switching. In particular, elements of \(\textbf{F}\) or \(\textbf{Z}\) may flip sign within an MCMC chain, or multiple chains may converge on parameters that are identical in magnitude but with different signs.
To minimize issues with label switching, previous work on Bayesian
factor analysis has proposed additional constraints on the loadings
matrix, including setting the elements of \(\textbf{Z}\) to be constrained (-1, 1), or
adding a positive constraint to the diagonal, \({ Z }_{ ii }>0\) (Aguilar and West 2000; Geweke and Zhou 1996).
Though these constraints generally help, there may be situations where
MCMC chains still do not converge. To address this issue, we adopt the
parameter-expanded priors for the loadings and trends proposed by Ghosh and Dunson (2009). To ensure that the sign
of the estimated quantities is the same across MCMC chains, we created
the function flip_trends() to flip the posterior samples of
MCMC chains relative to the first chain as needed.
There are several approaches for modeling extreme deviations in time series models. Techniques include modeling deviations as a two-component mixture (Ward et al. 2007; Evin, Merleau, and Perreault 2011), or modeling deviations with non-Gaussian distributions including the Student-t distribution (Praetz 1972; S. C. Anderson et al. 2017; S. C. Anderson and Ward 2018). There are several existing packages to include Student-t distributions; these include heavy for applications to regression and mixed effects models (Osorio and F. 2018), bsts for univariate time series models (Scott 2018), and stochvol for stochastic volatility models (Kastner 2016). Because switching from a Gaussian to Student-t distribution only introduces a single parameter, \(\nu\), the degrees of freedom, we extend the latter approach to a multivariate setting to model extreme events in the latent trends, so that deviations in the trends are modeled as \({ w }_{ t } \sim \mathrm{MVT}(\nu, 0, \textbf{Q})\). As before, \(\textbf{Q}\) is fixed as an identity matrix \(\textbf{I}\). Our parameterization constrains DFA models to have the same degrees of freedom \(\nu\) in the residuals of the multiple trends, which may be fixed a priori or treated as a free parameter with a gamma(shape = 2, rate = 0.1)\[2,$\infty$\] prior (Juárez and Steel 2010).
The trends of the dynamic factor model are most commonly modeled as
non-stationary random walks, \({ x }_{ i,t+1
}={ x }_{ i,t }+{ w }_{ i,t }\), where the \({ w }_{ i,t } \sim N(0,1)\) are Gaussian
white noise. Like with other vector autoregressive time series models,
this framework can be easily extended to include optional autoregressive
(AR) or moving average (MA) components (Chow et
al. 2011). We allow for AR(1) and MA(1) processes to be specified
with boolean arguments to the fit_dfa() function. For both
the AR(1) and MA(1) components, we assume separate parameters for each
trend. Including the AR(1) component \(\phi_{i}\) makes the trend process become
\({ x }_{ i,t+1 }=\phi_{i}{ x }_{ i,t }+{ w
}_{ i,t }\), where values of \(\phi_{i}\) close to 1 make the trend behave
as a random walk, and small values of \(\phi_{i}\) close to 0 make the trend behave
as white noise. Similarly, we model the MA(1) component as an AR(1)
process on the error terms \({ w }_{ i,t
}\). Instead of being independent at each time step, \({ \theta }_{ i }\) controls the degree of
autocorrelation among deviations, \({ w }_{
i,t }\sim \mathrm{N}\left( { { \theta }_{ i }w }_{ i,t-1 }, 1
\right)\). For stationarity and invertability, we constrain \(\left| \phi _{ i } \right|\) < 1 and
\(\left| \theta _{ i } \right|\) <
1.
Like factor analysis models, there are many solutions from a DFA model capable of producing the same fit to the data. Following previous authors, we use a varimax rotation of the loadings matrix \(\textbf{Z}\) to transform the posterior loadings and trends (Kaiser 1958; Harvey 1990; Holmes, Ward, and Scheuerell 2012). If \(\widehat { \textbf{Z} }\) is the posterior mean of the loadings matrix from a DFA model of 4 time series and 2 trends for example, the rotation matrix \(\textbf{ W }^{ * }=\mathrm{varimax}(\widehat{ \textbf{Z} } )\) is dimensioned \(2\: \times\: 2\). The rotated loadings matrix can then be calculated as \(\widehat{ \textbf{Z} }^{ * } = \widehat{\textbf{Z}}\textbf{ W }^{ * }\) and rotated trends calculated as \(\widehat{ \textbf{x} }^{ * }=\textbf{ W }^{ *-1 }\widehat{ \textbf{x} }\), where \(\widehat{\textbf{x}}\) is the posterior mean of the trends.
Since the number of trends in a DFA model is not a parameter,
comparing data support across models is often necessary. Using model
selection tools to identify data support is available via Akaike’s
Information Criterion (AIC) in packages implementing maximum likelihood
for estimation of state-space models (Petris
2010; Holmes, Ward, and Wills 2012). In addition to comparing the
relative support of different number of trends, model selection for
Bayesian dynamic factor models may be useful for evaluating the error
structure for the residual error covariance matrix \(\textbf{R}\), whether covariates should be
included, whether latent trends are better modeled with a distribution
allowing for extremes (MVT versus MVN), and whether the latent trends
support estimation of AR or MA components. For our Bayesian DFA models,
we extend the loo package
(Vehtari, Gelman, and Gabry 2016a, 2016b)
to generate estimates of LOOIC (Leave-One-Out Information Criterion) for
fitted models. To ease the selection process, bayesdfa includes
the function find_dfa_trends() to run multiple models
specified by the user. It returns a table of LOOIC values (denoting
which of those failed convergence criteria) and the model with the
lowest LOOIC value.
As a diagnostic tool, we include the function
find_swans() to fitted DFA models. We adopt the same
approach and terminology for ‘black-swan events’ as in S. C. Anderson et al. (2017), where black-swan
events are rare and unexpected extremes. Our find_swans()
function first-differences the posterior mean estimates of each DFA
trend and evaluates the probability of observing a difference that is
more extreme than expected under a normal distribution with the same
scale parameter. Events beyond a user-defined threshold (e.g. 1 in 100,
or 1 in 10,000) are then classified as outliers and plotted.
To evaluate the ability of the Bayesian DFA model to identify
anomalies in latent processes, we created simulated data using our
sim_dfa() function. We generated simulated multivariate
time series (\(n = 4\) time series with
\(T = 20\) time steps each) with \(m = 2\) underlying latent trends. Extremes
were included as a step-change in the midpoint of the first trend in
each simulated dataset. We varied the value of the step from -4 to -8,
which represent unlikely events under the assumption that temporal
deviations in the latent trends are distributed according to \(N(0,1)\). Because increased observation
error may corrupt inference about anomalies in the trends, we considered
three levels of observation error \((\sigma =
0.25, 0.75, 1.25)\). We generated 200 simulated samples for each
permutation of parameters, resulting in a total of 3000 datasets.
We fit the Bayesian DFA model with Student-t errors to each simulated dataset. As expected, the posterior estimates from these simulations illustrate that the ability to estimate low degrees of freedom is related to the magnitude of extremes (Figure 1). Similarly, higher observation error corrupts the ability to estimate extreme events, even when they are large in magnitude (Figure 1).
An alternative approach to DFA for dimension reduction of multivariate time series data are Hidden Markov Models (HMMs). Like DFA models, they model a latent process for a time series (or collection of multivariate time series). Instead of the latent process being modeled continuously (e.g. as a random walk in DFA), HMMs conceive the latent process as a series of discrete-time, discrete-state first-order Markov chains \(s_t \in \left\{ 1, \dots, G \right\}\) with the number of possible states \(G\) specified a priori. State transition is characterized by the \(G\: \times\: G\) transition matrix with simplex rows \(\mathbf{A} = \left\{ a_{ig} \right\}\) where \(a_{ig} = p({ s }_{ t } = g | { s }_{ t-1 } = i)\) represents the probability of transitioning from state \(i\) to \(g\). Useful quantities from HMMs include the transition probabilities between latent states, and the probability of being in a given lantent state at each point in time (Zucchini, MacDonald, and Langrock 2017).
HMMs can be applied to raw multivariate data to identify latent
states; however, they may also be linked with DFA to identify regimes
and transitions in the latent DFA trends. Similar to DFA, applications
of HMMs are widely available in R, including via the packages depmixS4
(Visser and Speekenbrink 2010), HMM (Himmelmann 2010), and msm (Jackson 2011). Consistent with our
implementation of the Bayesian DFA model, we include fully Bayesian
inference in Stan based on (Damiano, Peterson,
and Weylandt 2018). We apply independent HMM models to each DFA
trend to identify alternate states or regimes. Like with the estimation
of DFA models, we use the LOOIC metric to evaluate the relative support
for HMMs with different numbers of underlying states, selecting the
converged model with the lowest LOOIC. By default, we assume the
observation model of the input time series to be normally distributed
with the scale parameter equal to the estimated residual variance.
However, for some applications, such as datasets with changing sampling
frequencies over time, uncertainty in DFA trends may also vary through
time. To propogate this uncertainty forward, we also allow the residual
variance to be entered as a known quantity for every data point in our
find_regimes() function.
To illustrate an example application of the bayesdfa package to real data, we use monthly anomalies of sea surface temperature (SST, measured in \(C^\circ\)). SST is observed from satellite and buoy data at fixed locations, and model-based interpolations are used to generate estimates at additional gridded locations1. We used estimates generated at the locations of 4 observing stations used by the Pacific Fisheries Environmental Laboratory2 from the west coast of North America (USA). The four stations have some degree of correlation with one another, and are separated by approximately 6 degrees of latitude from one another. In summary, we work with \(n = 4\) monthly time series with \(T = 167\) observations each (from 2003–01 to 2016–05) and no missing values.
Initially, we fit a DFA model with 2 hidden trends, and will assume
the 4 time series to have the same error variances \(\textbf{R}\). We will fit the DFA model
with possible extremes, modeling process error with a Student-t
distribution by using the argument estimate_nu(). To
evaluate whether these data support an extreme DFA with trends modeled
as a t-distribution, we will fit two competing forms: one modeling the
random walks with a Gaussian distribution, and the other using a
Student-t distribution. Generating posterior samples for each model
takes approximately 7 minutes per chain, when MCMC chains aren’t run in
parallel.
After fitting the models, we confirm whether the MCMC chains are
consistent with convergence using a threshold value of \(\hat { R }=1.05\) (Gelman et al. 2014) using our
is_converged() function. We also visually inspect chain
traceplots (e.g. Figure 3) and check the minimum effective sample size
across parameters: NaN.
As a consistency diagnostic, we also retrieve the estimated degrees of freedom from the Student-t model \(\nu\). By visual inspection, Figure 4 shows that the posterior distribution on \(\nu\) is lower than the prior distribution.
We will focus the remaining portion of our analysis on the results from the DFA model with Student-t deviations. In Figure 5, we observe that Trend 1 and Trend 2 both support SST anomalies increasing over the latter half of the time series. Both trends appear to have reversed direction (reverting to the mean in the last 2–3 years) and this pattern is more evident in Trend 1. Because we do not model seasonality explicitly, for example by including a covariate effect for the month, each of the estimated trends also includes the within-year variability that describes seasonal patterns in observed sea surface temperature.
stats::varimax() rotation. In the violin plot of Figure 6, we note that more southern stations (24 and 30N) contribute largely to Trend 1, while the more northern stations appear to load more heavily on Trend 2.
stats::varimax() rotation. For each trend, we apply independent HMMs to examine the support for differing numbers of underlying regimes. Both the posterior mean and standard deviation (optional argument) will be the inputs to the HMM.
Using LOOIC as a metric of support for the number of regimes, the estimates reported in Table 1 support the inclusion of 2 regimes for both Trends 1 and 2.
| Regimes | LOOIC Trend 1 | LOOIC Trend 2 |
|---|---|---|
| 1 | 855.5 | 756.7 |
| 2 | 31.0 | 30.3 |
| 3 | 69.1 | 99.9 |
| 4 | 139.7 | 164.6 |
Our fit_regimes() function computes the probability of
each time point being in one of the regime states, which may also be
visualized using plot_regime_model(). For example, the
output of the 2-regime model for Trend 1 in Figure 7 suggests a change
in the middle of the time series, then changing back again to State 1.
Similarly, by the end of the series, the HMM assigns Trend 1 to being in
State 1.
There are a number of extensions to our implementation of the Bayesian DFA model with extremes that could make the model more applicable to a wider range of problems. Examples for the process model include adopting a skew-t distribution for asymmetric extremes. For models estimating multiple trends, multiple parameters may be treated hierarchically (e.g. covariate effects, variance parameters). For the observation or data model, our implementation of the Bayesian DFA model only includes data arising from a Gaussian or Student-t distribution, though this could be extended to include discrete or other continuous densities. Finally, spatial dynamic factor models (sDFA) have emerged as a useful tool for complicated multivariate spatial datasets (Lopes, Gamerman, and Salazar 2011; Thorson et al. 2015), and could be similarly implemented in Stan.
This paper presents the bayesdfa package for applying Bayesian DFA to multivariate time series as a dimension reduction tool, particularly if extreme events may be present in observed data. In addition to allowing for the inclusion of covariates, we also extend the conventional dynamic factor model to include optionial moving average and autoregressive components in the latent trends. Applying this package to a dataset of sea surface temperature from the Northeast Pacific Ocean, we fit DFA models with Gaussian and Student-t errors. Though the model with Student-t errors has slightly lower LOOIC, the results from the two models are similar. Output from these 2-trend DFA models of sea surface temperature are useful in demonstrating a north-to-south gradient in temperature anomalies (Figure 6). Standardized temperature data from southern stations experience more interannual variability and temperatures that are greater in magnitude compared to northern stations (Figure 5). We also illustrate how latent trends from DFA models can be analyzed in a HMM framework to identify regimes and transitions; applied to the sea surface temperature data, both Trend 1 and Trend 2 support 2-regime models (roughly interpreted as ‘warm’ and ‘cool’ regimes; Figure 7).
This work was funded by NOAA’s Fisheries and the Environment (FATE) Program. Development of this package benefitted from discussions with other members of our working group (Jin Gao, Chris Harvey, Sam McClatchie, Stepahni Zador) and scientists at the Northwest Fisheries Science Center (including Mark Scheuerell, James Thorson, Eli Holmes, and Kelly Andrews). 2 anonymous reviewers helped improve the clarity and plots of this paper.