Abstract
Linear models with fixed effects and many dummy variables are common in some fields. Such models are straightforward to estimate unless the factors have too many levels. The R package lfe solves this problem by implementing a generalization of the within transformation to multiple factors, tailored for large problems.A typical linear model looks like:
> y ~ x1+x2+x3 + f1+f2+f3
where f1, f2, f3 are arbitrary factors, and
x1, x2, x3 are other covariates. Coefficients may easily be
estimated by lm():
> lm(y ~ x1+x2+x3 + f1+f2+f3)
However, in some applications, in particular in econometrics, the
factors have too many levels and are still needed as fixed effects, as
in Abowd, Kramarz, and Margolis (1999).
The use of fixed effects as opposed to random effects is typically
necessitated by the possibility that the factors and the other
regressors are correlated. They study log wage (logwage) as
an outcome, where fixed effects for individuals (id) and
firms (firm) are included, with some ordinary covariates
(exemplified by x), i.e. a model of the type:
> logwage ~ x + id + firm
where id is a factor with one level for each employee,
and firm is a factor with one level for each firm.
Employees change firm now and then, so it is not a nested model. The
model is used to account for arbitrarily distributed time constant
individual and firm hetereogeneity, and is used to study the correlation
between the firm effect and the employee effect. There are also similar
cases with 3 factors, e.g. in (Torres,
P. Portugal, J. T. Addison, and P. Guimarães 2013), where job
title is also considered. They have a dataset of 27 million
observations, with 5.5 million, 568,000, and 96,000 levels in the three
factors. In other applications the factors are primarily used as
controls, as in (Markussen and K. Røed
2012), where a variety of models are used, controlling for
various combinations of interactions between factors. These datasets are
typically sourced from large public registries, and can contain tens of
millions of observations, with hundreds of thousands or millions of
factor levels. This far exceeds the capabilities of lm(),
and can also be too demanding for the sparse methods in package Matrix
(Bates and M. Maechler 2013).
When estimating such models, the case with a single factor is special. It can be estimated by using the within groups transformation, where the mean of the groups are subtracted from the covariates, resulting in a system without the factor, via the Frisch-Waugh-Lovell theorem. See e.g. Wooldridge (2002, sec. 10.5). The coefficients for the factor levels can easily be recovered as the group means of the residuals. This estimation method can be found in package plm (Croissant and G. Millo 2008). The method can also be used with more than one factor, by using the within transformation to project out the factor with the highest number of levels, coding the others as dummy variables. However, if all of the factors have many levels this can still result in a too large system which is non-sparse. Moreover, having such sets of dummies in datasets which are not balanced may lead to non-trivial identification problems. We will illustrate how to use the package lfe (Gaure 2013a) to solve some of these problems. For clarity we construct quite small datasets by drawing random covariates, rather than to refer to large real datasets which are not publicly available.
The package lfe is designed to handle the above estimation
problem. It also contains some methods for solving identification
problems, though not completely in all cases. Since lfe handles
quite ordinary linear models which conceptually could be handled by
lm(), we do not go into detail about the feasibility of
fixed effect linear models as such, only the problems which are more or
less peculiar to models with a large number of dummies.
The short story is that coefficients for x1,
x2, and x3 in the above model can be estimated
by:
> library(lfe)
> est <- felm(y ~ x1+x2+x3 + G(f1)+G(f2)+G(f3))
whereas the coefficients for the factor levels of f1,
f2, and f3 can be retrieved by:
> alpha <- getfe(est)
A longer story follows, the theoretical part of which is mainly based on Gaure (2013b).
We write our model in matrix form
\[\label{eq:fullsystem} y = X\beta + D\alpha + \epsilon, (\#eq:fullsystem) \]
where \(D\) is a matrix of dummies coding the factor levels, \(X\) is a matrix with the other covariates, \(y\) is the response vector, and \(\epsilon\) is a normally distributed error term. Performing an ordinary least squares regression (OLS) on this system yields the OLS estimates \(\hat\beta\) for the \(X\) covariates and \(\hat\alpha\) for the factor levels. The Frisch-Waugh-Lovell theorem states that if \(P\) is the projection onto the orthogonal complement of the range of \(D\), then the projected system \[Py = PX\beta + P\epsilon,\] yields the same \(\hat\beta\) when estimated with OLS. Moreover, the matrix \((X^tPX)^{-1}\) which is used to find the covariance matrix of \(\hat\beta\), is identical to the \(\hat\beta\)-part of the corresponding matrix in the full system @ref(eq:fullsystem). Similarly, the residuals are identical. The projected system does not contain the dummies for the factor levels, and may therefore be manageable with conventional methods.
Unfortunately, \(P\) is an \(n \times n\) matrix, where \(n\) is the number of observations, so it is
impractical to compute \(P\) when \(n\approx 10^7\). However, it is easier to
compute \(Px\) for a vector \(x\), i.e. \(y\) and the columns of \(X\). In the special case when \(D\) encodes a single factor, the
transformation \(x \mapsto Px\) is the
within transformation, i.e. centring of the groups on their means. It is
shown in Gaure (2013b) that when there is
more than one factor, say \(e > 1\)
factors, we may consider the centring transformation for each of them, a
projection \(P_i\) for each \(i = 1\ldots e\), and compute \(Px\) as \[Px =
\lim_{m\to\infty} \left(\left(P_1 P_2 \cdots P_e\right)^m
x\right).\] This approach is known as the method of
alternating projections and is based on a result by Halperin (1962, Theorem 1). The procedure can
easily be implemented with an R function taking as input a vector
x and the list of factors flist:
> demean <- function(x, flist) {
+ cx <- x; oldx <- x - 1
+ while(sqrt(sum((cx - oldx) ^ 2)) >= 1e-8) {
+ oldx <- cx
+ for(f in flist) cx <- cx - ave(cx, f)
+ }
+ return(cx)
+ }
This algorithm was also arrived at by Guimarães and P. Portugal (2010, 637) as a technical simplification of their iterated estimation approach to the same problem.
For efficiency reasons, this linear transformation has been written
in C, made threaded to centre vectors in parallel, and is available as
the function demeanlist() in lfe, though the
function felm() wraps it in an lm() like
function.
We create a simple example to illustrate the usage. We have 100,000
observations of a covariate x, and two factors
f1, f2, each with 10,000 randomly drawn
levels. We create an outcome variable y and estimate the
x coefficient. The G() syntax is used to
specify which factors should be projected out of the system, a similar
syntax as for the Error() term in aov(). The
G() is not an R function in itself, though it translates to
as.factor() inside felm() after the
G() terms have been removed from the model for special
handling.
> library(lfe)
> set.seed(42)
> x <- rnorm(100000)
> f1 <- sample(10000, length(x), replace=TRUE)
> f2 <- sample(10000, length(x), replace=TRUE)
> y <- 2.13*x + cos(f1) + log(f2+1) + rnorm(length(x), sd=0.5)
> est <- felm(y ~ x + G(f1) + G(f2))
> summary(est)
Call:
felm(formula = y ~ x + G(f1) + G(f2))
Residuals:
Min 1Q Median 3Q Max
-1.9531308 -0.3018539 -0.0003573 0.3007738 2.2052754
Coefficients:
Estimate Std. Error t value Pr(>|t|)
x 2.130889 0.001768 1205 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.5013 on 80000 degrees of freedom
Multiple R-squared: 0.9683 Adjusted R-squared: 0.9603
F-statistic: 122.1 on 19999 and 80000 DF, p-value: < 2.2e-16
The result of felm() is a ‘felm’ object
which is quite similar to an ‘lm’ object. However, it is
not fully compatible with it, so some methods for ‘lm’
objects may work on a ‘felm’ object, others may not. In an
earlier version, the ‘felm’ object inherited from the class
‘lm’, but there are important differences in the structure,
so this inheritance has been removed. In particular there is no
qr-decomposition in the ‘felm’ object, so methods for
‘lm’ objects which depend on the qr-decomposition, such as
anova(), can not be used. Also, the
‘felm’-object does not contain a copy of the dataset. This
has been removed to conserve memory for very large datasets. Simple
extractors like coef() and residuals() work.
Extracting the covariance matrix with vcov() also works,
via a separate S3 method, albeit only for the \(\hat\beta\)s. Also, a
summary() S3-method, and an accompanying
print() method have been included. It is possible to try
‘lm’-methods explicitly as in:
getS3method(’vcov’, ’lm’)(est), though the author does not
guarantee any success with this approach.
The careful reader has noticed that the behaviour of
summary() on a ‘felm’ object with respect to
degrees of freedom and \(R^2\) is the
same as that of on an ‘lm’ object when including an
intercept. There is no explicit intercept in the result of
felm(), but the factor structure includes one
implicitly.
If the factors are used as controls only, the above procedure is all we have to worry about, but if we also need the group coefficients \(\hat\alpha\), what econometricians often refer to as the fixed effects, we can solve the equation
\[ D\hat\alpha = (I-P)(y-X\hat\beta), (\#eq:kaczmarz) \]
for \(\hat\alpha\). The right hand
side is easily computed when we have \(\hat\beta\), \(PX\) and \(Py\). The equation is solved by the
Kaczmarz method (Kaczmarz 1937), as
described in Gaure (2013b). The method is
available as a function kaczmarz(), but the wrapper
getfe() is more useful. A solution of equation
@ref(eq:kaczmarz) is not unique, the matrix \(D\) dummy-encodes all the factor
levels, hence there will be multicollinearities (unless there is only
one factor), both the obvious ones, but there can also be spurious ones.
The multicollinearities are resolved by applying an estimable function.
lfe contains a function efactory() for creating
estimable functions, but users can also supply their own.
getfe() returns a ‘data.frame’ with some
information in addition to the coefficients. The reason for this is that
the intended use of lfe is for factors with so many levels that
they probably anyway must be analyzed according to the researcher’s
needs prior to being presented. We continue the example:
> alpha <- getfe(est)
> nrow(alpha)
[1] 20000
> alpha[9998:10003,]
effect obs comp fe idx
f1.9998 -0.2431720 9 1 f1 9998
f1.9999 -0.9733257 5 1 f1 9999
f1.10000 -0.8456289 9 1 f1 10000
f2.1 0.4800013 9 1 f2 1
f2.2 1.4868744 14 1 f2 2
f2.3 1.5002583 11 1 f2 3
The ’obs’ column is the number of observations of this
level. The ’fe’ and ’idx’ columns are
factors containing the same information as the row name,
but split into the name of the factor and the level for
convenience. The ’comp’ column is important in that it is
used for identification purposes, and here is a main deviation from how
lm() treats identification in the presence of factors. The
default method when using lm() is taken from the global
’contrasts’ option which defaults to using treatment
contrasts for each factor. It introduces a reference level for each
factor, this is equivalent to forcing the coefficient for the reference
level to zero, and the other coefficients are only meaningful when
compared to this zero coefficient. It is also possible to specify
different constraints for the lm() coefficients, such as
forcing the sum of the coefficients to zero, or an arbitrary one via the
’contrasts’ option, or the ’contrasts’
argument to lm(), but typically a single constraint is used
for each factor.
Identification when the factors have many levels can be a more
complicated matter. The standard example in the econometrics literature
is the one found in Abowd, Kramarz, and Margolis
(1999), elaborated in Abowd, Creecy, and
Kramarz (2002). In this case there are two factors, one for
employees and one for firms. It may happen that one set of employees
move between one set of firms, whereas another disjoint set of employees
move between some other firms. There are no movements between these
mobility groups, hence coefficients from different groups can not be
compared. A useful construction for analysis of this problem is the
undirected, bipartite graph which has the factor levels as vertices, and
an edge between levels that occur in the same observation. The mobility
groups are then the connected components of this graph. It is shown in
Abowd, Creecy, and Kramarz (2002, Appendix
1) that for identification of the coefficients, it is sufficient
to introduce a single reference level in each of the disjoint mobility
groups, either a firm or an employee. This was previously established by
Eccleston and A. Hedayat (1974) in a
different context, and there is another argument in Gaure (2013b) using spectral graph theory. The
’comp’ column in the return value from getfe()
when using estimable functions from efactory() is a factor
enumerating these groups, i.e. the connected components.
efactory() chooses by default the level with the highest
number of observations as a reference level, and sets the coefficient to
0. When interpreting the coefficients, they should never be compared
between the components; a coefficient is only meaningful relative to the
reference level in the component it belongs to. For this reason, one may
choose to restrict attention to the largest component only, the one with
comp == 1, if this is feasible for the problem at hand.
To the author’s knowledge, the identification problem when there are
more than two factors has not been solved in general.
efactory()’s behaviour with more than two factors is to
assume that the connected components of the two first factors are
sufficient for identification, and a single reference is used in each of
the remaining factors. lfe contains a probabilistic test for
estimability in this case, the one described in Gaure (2013b, Remark 6.2), and
getfe() will issue a warning if it finds an identification
problem. The test is also available as the function
is.estimable(). In theory, the test can fail only on a set
of measure zero. Specifically, the test uses the fact that an estimable
function must evaluate to the same value on all solutions of
equation @ref(eq:kaczmarz). Different solutions can be obtained by
running the Kaczmarz algorithm with different initial vectors. By
starting with the zero vector we obtain the solution with least
Euclidean norm. The test works by comparing the value of the candidate
estimable function on the least norm solution and a solution obtained by
drawing the initial vector \(\eta\) at
random. The test will only fail to find an identification problem if
\(\eta\) happens to lie in a
particular, but unknown, subspace of positive codimension, i.e. in a set
of Lebesgue measure zero. However, due to numerical inaccuracies, finite
arithmetic, and bad luck, the test may in practice fail to find an
identification problem even if there is one, with some very small
positive probability. It does however not report identification problems
when there are none. If there are identification problems which are
unaccounted for, then some, or all of the resulting coefficients are
non-estimable, i.e. they are meaningless.
Sometimes, an identification problem can be alleviated by changing
the order of the G() terms in the formula supplied to
felm(). To illustrate, consider a situation where we add a
third factor to the example in the Introduction, nkids, the
number of an individual’s offspring, with only 5 levels. If our formula
contains G(firm) + G(nkids) + G(id), it is likely that the
graph associated with the two first factors, firm and
nkids, is connected, and there can be many mobility groups
which are unaccounted for. getfe() issues a warning, the
returned coefficients are non-estimable; there is no correct indication
of which mobility groups the coefficients belong to, and therefore, we
do not know which coefficients can be meaningfully compared. If we
change the model specification to
G(id) + G(firm) + G(nkids), or
G(id) + G(firm) + nkids, the mobility groups are found, and
accounted for.
Another approach to the identification problem is found in Carneiro, Guimarães, and Portugal (2012). They
restrict attention to a subset of the dataset in which, by construction,
all elementary contrasts, i.e. differences between coefficients, are
estimable, as described by Weeks and D. R.
Williams (1964). Such a partitioning of the dataset can be found
by constructing a graph with the observations as vertices, and with an
edge between two observations if they differ in at most a single factor.
The partitions of the dataset correspond to connected components of the
graph. They can be found by the function
compfactor(..., WW=TRUE), which returns a factor
enumerating the partitions. It can be used to select a subset of the
dataset prior to calling felm(). When there are only two
factors, this partitioning structure coincides with the mobility groups
discussed above. In some datasets, like the one in Torres, P. Portugal, J. T. Addison, and P. Guimarães
(2013), the largest partition comprises almost the entire
dataset, but if it does not, such a selection may introduce bias in the
coefficients. An extreme example:
> set.seed(42)
> f1 <- factor(sample(50, 1000, replace=TRUE))
> f2 <- factor(sample(50, 1000, replace=TRUE))
> f3 <- factor(sample(50, 1000, replace=TRUE))
> ww <- compfactor(list(f1,f2,f3), WW=TRUE)
> head(table(ww))
ww
1 2 3 4 5 6
29 20 19 16 14 14
The largest partition has 29 observations, fewer than the number of
levels, even though far more estimable functions exist. This can be seen
by computing the rank deficiency of the matrix \(D\) with the function
rankMatrix() from package Matrix. Indeed, it is
sufficient with two references in the 3 factors:
> D <- t(do.call('rBind', lapply(list(f1,f2,f3), as, 'sparseMatrix')))
> ncol(D) - as.integer(rankMatrix(D))
[1] 2
The standard method of efactory(), to analyze the
connected components of the first two factors only, works well in this
particular case. It will find a reference among the two first factors,
and a reference in the last, and differences between coefficients within
each factor will be estimable:
> x <- rnorm(1000)
> y <- 3.14*x + log(1:50)[f1] + cos(1:50)[f2] + exp(sqrt(1:50))[f3] + rnorm(1000, sd=0.5)
> est <- felm(y ~ x + G(f1) + G(f2) + G(f3))
> coef(est)
x
3.139781
> is.estimable(efactory(est), est$fe)
[1] TRUE
This can happen because the construction in Weeks and D. R. Williams (1964, Theorem 1) does not find a maximal subset, only a subset which is guaranteed to have the desired property.
Yet another algorithm for finding estimable functions is described by Godolphin and E. J. Godolphin (2001), but lfe does not implement it.
The function efactory() in lfe is by default
called by getfe() to create an estimable function for a
factor structure. The default is to use reference levels as described
above, some other estimable functions are also available. However,
researchers may have other needs, so it is possible to supply a user
written estimable function to getfe(). We describe this
interface with an example, for a small dataset. The function takes as an
argument a vector of length the sum of the number of levels of the
factors, i.e. a solution to equation @ref(eq:kaczmarz), applies an
estimable function of choice, and returns the result. It is not
necessary to include a full set of estimable functions; if so desired,
our function may return just a single difference between two specific
factor levels, or even some nonlinear transformation thereof. It should,
however, evaluate to the same value on all the different solutions of
equation @ref(eq:kaczmarz).
We make a small dataset with 100 observations, a covariate
x and 3 factors with a handful of levels.
> set.seed(42)
> x <- rnorm(100)
> f1 <- factor(sample(4, 100, replace=TRUE))
> f2 <- factor(sample(5, 100, replace=TRUE))
> f3 <- factor(sample(6, 100, replace=TRUE))
> e1 <- sin(1:4)[f1] + 0.02*((1:5)^2)[f2] + 0.17*((1:6)^3)[f3] + rnorm(100)
> y <- 2.5*x + (e1-mean(e1))
> est <- felm(y ~ x + G(f1) + G(f2) + G(f3))
Then we create an estimable function which is the same as the one
provided by lm() when using treatment contrasts. The
addnames argument is there because in the event that the
estimable function is used in bootstrapping, adding names to very long
vectors would only contribute to exercising R’s memory management; hence
names should only be added when required. It is also possible to add an
attribute called ’extra’, which is a named
list of vectors of the same length as the names, for adding additional
information to the result of getfe(). This
attribute is added by the estimable functions returned by
efactory() to provide the ’obs’,
’fe’, ’comp’ and ’idx’ columns,
but the content is not used for anything inside lfe.
> ef <- function(gamma, addnames) {
+ ref1 <- gamma[1] # first level of f1
+ ref2 <- gamma[5] # first level of f2
+ ref3 <- gamma[10] # first level of f3
+ # put the intercept in the first coordinate
+ icpt <- ref1 + ref2 + ref3
+ # subtract the references for each factor
+ # unlike the efactory() functions, we omit the zero coefficients.
+ result <- c(icpt, gamma[2:4]-ref1, gamma[6:9]-ref2, gamma[11:15]-ref3)
+ if(addnames) {
+ names(result) <- c('(Intercept)',
+ paste('f1', levels(f1)[2:4], sep=''),
+ paste('f2', levels(f2)[2:5], sep=''),
+ paste('f3', levels(f3)[2:6], sep=''))
+ attr(result, 'extra') <- list(fe=factor(
+ c('icpt', rep('f1',3),
+ rep('f2',4), rep('f3',5))),
+ idx=factor(c(1, 2:4, 2:5, 2:6)))
+ }
+ result
+ }
We now check that our function is estimable:
> is.estimable(ef, list(f1,f2,f3))
[1] TRUE
The estimable function is supplied to getfe() via the
argument ef. In this example we also request bootstrapped
standard errors with the arguments se and bN,
we elaborate on this in the next section.
> getfe(est, ef=ef, se=TRUE, bN=1000)
effect fe idx se
(Intercept) -10.9016327 icpt 1 0.3077378
f12 -0.1265879 f1 2 0.2356162
f13 -0.7541019 f1 3 0.2896058
f14 -1.7409436 f1 4 0.2776542
f22 0.4611797 f2 2 0.3012931
f23 0.6852553 f2 3 0.2898361
f24 0.8467309 f2 4 0.3232411
f25 0.5886517 f2 5 0.2841049
f32 1.0898551 f3 2 0.3364884
f33 4.3490898 f3 3 0.3058420
f34 10.7505266 f3 4 0.3377505
f35 21.3832700 f3 5 0.3649107
f36 36.7369397 f3 6 0.3059049
getfe() performs no automatic addition of columns like
’idx’ or ’comp’, they have to be provided by
the estimable function. The reason being that getfe() does
not know the structure imposed by the user written function. If
bootstrapped standard errors are requested via the se
argument, the ’se’ column is added also for user written
functions. If you are certain that your function is estimable, you can
add an attribute to it, attr(ef, ’verified’) <- TRUE,
this causes getfe() to skip the estimability test, and may
save significant amounts of time with datasets for which convergence is
slow. The functions produced by efactory() for one and two
factors have this attribute set.
The standard errors for \(\hat\beta\) are easily computed, and are
identical to the the standard errors from lm() if we were
to run it with all the dummies. However, there is a minor difference.
The degrees of freedom depends on the rank of the matrix \(D\), or \(D^t
D\), or the trace of \(P\). The
rank is easily computed for the cases with one or two factors, but
requires a much more time consuming approach for 3 or more factors. The
default approach of felm() is to assume that the column
rank deficiency of \(D\) is \(e-1\) when there are \(e >
2\) factors. This may result in a too high value for the rank,
hence a too low value for the degrees of freedom, which will yield too
high standard errors. In most real cases the author has witnessed, the
error is negligible. felm() has an argument
exactDOF which may be set to TRUE to activate
a more accurate computation of the rank. It does a sparse pivoted
Cholesky factorization of \(D^t D +
\epsilon I\) for some small \(\epsilon\) and finds the rank deficiency by
counting small pivots. This is not a perfect method, as noted by Higham (1990), but it seems to work well for the
matrices the author has encountered, and it is much faster than
rankMatrix() from package Matrix. To use
rankMatrix() instead, one may specify
exactDOF=’rM’. If, for some reason, one happens to know the
degrees of freedom in advance, they can be specified as a numeric, like
exactDOF=12536819.
The standard errors for \(\hat\alpha\), the coefficients for the
factor levels, can be estimated by bootstrapping. This requires
resampling the residuals with replacement, and do the whole estimation
over and over again, an operation which can be very time consuming. It
can be requested, as in the above example, by the se
argument to getfe(), the number of samples can be specified
in the bN argument. The common practice of resampling
observations rather than residuals is not useful for this kind of model,
it results in a different factor structure, hence possibly a different
set of identified coefficients in each sample.
lfe supports instrumental variables estimation through 2SLS, i.e. two step OLS (Wooldridge 2002, chap. 5).
Here is an example. We first create some covariates:
> set.seed(276709)
> x <- rnorm(10000)
> x2 <- rnorm(length(x))
> x3 <- rnorm(length(x))
Then we create some factors, and their effects:
> id <- factor(sample(2000, length(x), replace=TRUE))
> firm <- factor(sample(1300, length(x), replace=TRUE))
> id.eff <- rnorm(nlevels(id))
> firm.eff <- rnorm(nlevels(firm))
and a normally distributed error term u, and an outcome
y:
> u <- rnorm(length(x))
> y <- x + 0.5*x2 + id.eff[id] + firm.eff[firm] + u
We create a covariate Q which is correlated with the
other covariates, the instrument x3, and the error term,
and add it to the outcome:
> Q <- 0.3*x3 + x + 0.2*x2 + 0.5*id.eff[id] + 0.7*u + rnorm(length(x), sd=0.3)
> y <- y + 0.9*Q
Estimation is now carried out by specifying the instrumented variable
equation with the iv argument of felm. Only
the instrument variable, in this case x3, is needed, the
other covariates are added by felm().
> ivest <- felm(y ~ x + x2 + G(id) + G(firm) + Q, iv=Q ~ x3)
> summary(ivest)
Call:
felm(formula = y ~ x + x2 + G(id) + G(firm) + Q, iv = Q ~ x3)
Residuals:
Min 1Q Median 3Q Max
-6.036142 -0.903260 0.000759 0.913758 4.971614
Coefficients:
Estimate Std. Error t value Pr(>|t|)
x 0.94963 0.03975 23.89 <2e-16 ***
x2 0.49567 0.01449 34.20 <2e-16 ***
`Q(fit)` 0.94297 0.03816 24.71 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.668 on 6717 degrees of freedom
Multiple R-squared: 0.8112 Adjusted R-squared: 0.7189
F-statistic: 8.791 on 3282 and 6717 DF, p-value: < 2.2e-16
felm() supports more than one instrumented variable, in
this case the iv argument must be a list of
formulas, one for each instrumented variable, with all the
instruments on the right hand side.
The author has been made aware that both the G() syntax
and the handling of instrumental variables equations could be done in a
smoother fashion with multi-part formulas. Such a syntax is
not presently available in lfe.
It is a fact of life that both felm() and
getfe() may take quite a while to complete. What is more
unfortunate is that time to completion does not only depend on the size
of the dataset, but also on its structure. In particular, the
dependencies between the factors, beyond the pure identification
problems, have a huge impact. The rate of convergence for the general
method of alternating projections has been analyzed by Aronszajn (1950; Kayalar and H. L. Weinert 1988;
Gearhart and M. Koshy 1989; Deutsch and H. Hundal 1997; Bauschke et al.
2003; Badea, S. Grivaux, and V. Müller 2012), among others. Their
results are in terms of the cosine \(c\) of generalized angles between the
subspaces corresponding to the projections \(P_i\). For two factors, the convergence
rate in operator norm is known to be
\[\label{convrate} \|(P_1 P_2)^n - P\| = c^{2n-1}, (\#eq:convrate) \]
where \(0 \leq c < 1\). The
convergence is linear in the case with two factors, but the rate depends
heavily on the structure of the factors. With 3 or more factors, the
convergence rate depends on both the structure of the factors and their
order in the iterations, in a quite complicated way (Deutsch and H. Hundal 1997 Theorem 2.7). The
theoretical convergence results are also valid for the Kaczmarz method
used by getfe(), as this is a special case of the method of
alternating projections (Deutsch and H. Hundal
1997, sec. 4). Though, for our Kaczmarz step the number of
projections is the number of observations, not the number of
factors.
We do not offer general results which relate intuitive properties of the factors to the convergence rate, but here are some small examples to illustrate the complexity.
In our first example the factors f1 and f2
are independent:
> set.seed(54)
> x <- rnorm(100000)
> f1 <- sample(10000, length(x), replace=TRUE)
> f2 <- sample(300, length(x), replace=TRUE)
> y <- x + cos(f1) + log(f2+1) + rnorm(length(x), sd=0.5)
We time the estimation:
> system.time(est <- felm(y ~ x + G(f1) + G(f2)))
user system elapsed
2.420 0.008 1.955
> system.time(alpha <- getfe(est))
user system elapsed
0.256 0.008 0.263
This is a quite fast example, in general it is the author’s experience that datasets with this kind of full independence between the factors converge quite fast. Convergence is equally fast with 10,000 levels in the second factor as well.
But then we introduce a dependence which makes convergence slow. We
let the second factor be closely related to the first, with some
stochasticity, but we do not increase the number of levels. In this
example, the second factor f3 can only have 5 different
values for each value of f1.
> f3 <- (f1 + sample(5, length(x), replace=TRUE)) %% 300
> y <- x + cos(f1) + log(f3+1) + rnorm(length(x), sd=0.5)
> system.time(est <- felm(y ~ x + G(f1) + G(f3)))
user system elapsed
34.624 0.000 18.804
> system.time(alpha <- getfe(est))
user system elapsed
3.868 0.008 3.880
Execution time increases by an order of magnitude, even though the size of the dataset is the same as before. In this example, with only 300 levels in the second factor, we may as well encode it as ordinary dummies, reducing our model to the classical within groups estimator, with 300 covariates:
> system.time(est <- felm(y ~ x + G(f1) + factor(f3)))
user system elapsed
10.340 0.832 5.379
> length(coef(est))
[1] 300
> system.time(alpha <- getfe(est))
user system elapsed
0.192 0.000 0.192
This is far from being the whole story. A “small” change to the
second factor brings the execution time back down. We still have only 5
different values of the second factor f4 for each value of
f1, but they are spread irregularly apart:
> f4 <- (f1 + sample(5, length(x), replace=TRUE)^3) %% 300
> y <- x + cos(f1) + log(f4+1) + rnorm(length(x), sd=0.5)
> system.time(est <- felm(y ~ x + G(f1) + G(f4)))
user system elapsed
2.564 0.000 2.081
> system.time(alpha <- getfe(est))
user system elapsed
0.240 0.004 0.244
To better appreciate the complexity, we create two more examples which are similar to the previous one, but the randomly drawn numbers are spread regularly apart. The first one results in slow convergence:
> f5 <- (f1 + sample(seq(1,197,49), length(x), replace=TRUE)) %% 300
> y <- x + cos(f1) + log(f5+1) + rnorm(length(x), sd=0.5)
> system.time(est <- felm(y ~ x + G(f1) + G(f5)))
user system elapsed
32.636 0.000 17.368
> system.time(alpha <- getfe(est))
user system elapsed
3.972 0.000 3.975
The second one converges fast:
> f6 <- (f1 + sample(seq(1,201,50), length(x), replace=TRUE)) %% 300
> y <- x + cos(f1) + log(f6+1) + rnorm(length(x), sd=0.5)
> system.time(est <- felm(y ~ x + G(f1) + G(f6)))
user system elapsed
2.548 0.000 2.073
> system.time(alpha <- getfe(est))
user system elapsed
0.244 0.000 0.244
By comparing the “user” and “elapsed” times for felm()
in the above examples, it can be inferred that most of the time in the
fast examples is spent in bookkeeping outside the centring, otherwise
“user” time would be close to twice the “elapsed” time, since the actual
centring of y and x is done in parallel. This
means that the actual centring convergence rate differences are larger
than the reported differences in execution time. For a careful analysis
of centring times only, the centring process should be timed directly,
as mentioned in Gaure (2013b, top of
p. 17).
The author has not succeeded in finding general intuitive guidelines for the convergence rate. However, with two factors, a relevant concept seems to be the bipartite graph where the vertices are factor levels and there is an edge between levels which are observed together. It is the connected components of this graph which determines estimability. The graph can be analyzed with the tools in package igraph (Csardi and T. Nepusz 2006) as follows:
> library(igraph)
> mkgraph <- function(flist) {
+ graph.adjacency(tcrossprod(do.call('rBind',
+ lapply(flist, as, 'sparseMatrix')))>0,
+ 'undirected', diag=FALSE)
+ }
We make a list of the associated graphs:
> glist <- lapply(list(f2,f3,f4,f5,f6),
+ function(f) mkgraph(lapply(list(f1,f), factor)))
The graphs, except for the last, are connected:
> sapply(glist, no.clusters)
[1] 1 1 1 1 50
The average degrees, i.e. the number of edges in the graph, show no convincing signs of being strongly related to the estimation time:
> sapply(glist, function(g) mean(degree(g)))
[1] 19.100301 8.404311 8.410719 8.400039 8.423925
A property of the graph which shows some promise of correlating with the convergence rate is the diameter. To get some intuituion of what this is, consider the example with firms and employees. The associated graph has as vertices the firms and employees, and there is an edge between a firm and an employee if the employee has worked for the firm. A path between e.g. two employees Roy and Floyd can be constructed by finding a firm that Roy has worked for, then another employee Warshall of this firm, then another firm for Warshall, zigzagging between firms and their employees until we reach the employee Floyd. For each pair of vertices (i.e. firms and employees) in a mobility group, there is a shortest path. The length of the longest of the shortest paths among all vertex pairs is the graph’s diameter.
In these particular examples, it turns out that in the cases with
fast convergence (the first, third, and fifth), the shortest paths
between pairs of vertices are typically much shorter than in the slowly
converging cases. We do not compute the diameter as a typical fast
algorithm for this, the Roy-Warshall-Floyd algorithm, has
complexity of order \(O(n^3)\), where
\(n\) is the number of vertices,
i.e. 10300 in our example. Below, for simplicity, we exclude the last
graph, it is disconnected, hence of infinite diameter, though the
diameters of its (comparably small) components are relevant. A component
of this graph would typically contain only about \(300/50=6\) different levels of the second
factor f6, so its diameter can’t possibly be very
large.
> for(gr in glist[1:4])
+ print(fivenum(shortest.paths(gr, v=sample(V(gr),10), to=sample(V(gr),10))))
[1] 2 2 4 4 4
[1] 2 20 39 62 76
[1] 2 4 6 6 8
[1] 2 18 40 58 76
Also, the many variants of factor structures in Markussen and K. Røed (2012) (described in Gaure (2013b)) and ongoing subsequent works, use factors which are interactions of other factors, in such a way that there are a few collinearities by design. This is not a conceptual problem when the factors are used as controls, but manually removing these known collinearities by merging some levels into a common “reference” level, yields performance improvements of up to two orders of magnitude (Markussen 2013), a phenomenon which was not noted in (Gaure 2013b). Incidentally, this also leads to shorter paths via the common reference level.
This suggests that, loosely speaking, datasets with six degrees of separation converge fast, whereas less well connected datasets converge slower. However, we cannot rule out other structural differences which may have an impact on the convergence rate; there is no such thing as “larger diameter, ceteris paribus” for graphs. It is possible that a diameter path could be used to construct a lower bound for the \(c\) in equation @ref(eq:convrate), but the author has not found a proof for such an assertion.
The convergence rate with more than two factors is more complicated,
the theoretical results of Deutsch and H. Hundal
(1997) are less complete, the rate is not a function of the
cosines of pairs of subspaces, and even the order of the
projections matters. In cases where only some of the factors have many
levels, the remaining factors may be specified without G(),
treating them as ordinary dummy-encoded covariates.
lfe utilizes two acceleration techniques for the alternating
projections methods. In demeanlist(), which is used by
felm() to centre the covariates, the line search method in
Bauschke et al. (2003, Theorem 3.16) is
used, even though their Example 3.24 shows that in some special
cases with more than two factors, this method is in fact slower. For the
Kaczmarz method, random shuffling of the equations is used, as suggested
by Gaure (2013b).
felm() is thread parallelized over the vectors to be
centred, i.e. all variables in the model except those enclosed in
G(). The number of processors is fetched with
getOptions(’lfe.threads’) which is initialized upon loading
of lfe from the environment variable LFE_THREADS
or OMP_NUM_THREADS, or otherwise from an heuristic snatched
from package multicore
(Urbanek 2011). It may of course be set
manually after loading the package. There is no benefit from specifying
more threads than there are variables to centre. getfe() is
not parallelized as such, but bootstrapping with
getfe(..., se=TRUE) is.
lfe is similar in function, but not in method, to the Stata (StataCorp. 2013) modules A2reg (Ouazad 2008) and felsdvreg (Cornelißen 2008). It is the same algorithm as in the Stata module reg2hdfe (Guimarães 2009), described by Guimarães and P. Portugal (2010) in terms of Gauss-Seidel iterations.
The package lfe contains methods for estimating ordinary
least square models with multiple factors with too many levels for
conventional methods like lm(). Such models may exhibit
non-trivial identification problems. These are satisfactorily solved for
the two factor case, and the package also includes some partial
solutions for the case with more factors. Instrumental variable
regression via the two step OLS is also supported. Convergence rate can
be an issue for some datasets, and the present paper suggests some
intuitive structural reasons for this.
The method employed by lfe is known as the method of alternating projections, a somewhat old and well studied method which to the author’s knowledge has not been applied to the particular problem of linear regression with dummy variables before.
I wish to thank the associate editor and two anonymous reviewers for helpful comments, as well as Bjørn-Helge Mevik for fruitful discussion.