Boston Housing data
We consider the Boston housing data to demonstrate the real data
application of the proposed R package SIQR. This dataset
contains the median value of houses (in $1000’s), medv, in 506 tracts in
Boston and 13 other socio-demographic related variables. This data has
been investigated by many studies. Heterogeneity and some non-linear
dependence of medv on predictor variables have been found by previous
researchers. The dataset is maintained at the StatLib library of
Carnegie Mellon University and can be found at the R built-in package
MASS.
We focus on the following four covariates: RM, the average number of
rooms per dwelling; TAX, the full-value property tax (in $) per $10,000;
PTRATIO, the pupil-teacher ratio by town; and LSTAT, percentage of the
lower status of the population as in (Opsomer and
Ruppert 1998), (K. Yu and Lu 2004),
and (Wu, Yu, and Yu 2010). Following
previous studies, we take logarithmic transformations on TAX and LSTAT
and center the dependent variable medv around zero.
We use the following codes to load data from MASS and
pre-process as discussed above. We fit a single-index quantile
regression with \(\tau =
0.25,0.50,0.75\) to the data and report fitted single-index
coefficients for each variable.
library(SIQR)
#load data from MASS
library(MASS)
medv<- Boston$medv
RM <- Boston$rm
logTAX <- log(Boston$tax)
PTRATIO <- Boston$ptratio
logLSTAT <- log(Boston$lstat)
X <- cbind(RM,logTAX,PTRATIO,logLSTAT)
y0 <- medv - mean(medv)
beta0 <- NULL
tau.vec <- c(0.25,0.50,0.75)
est.coefficient <- matrix(NA, nrow = length(tau.vec), ncol = 5)
est.coefficient[,1] <- tau.vec
for (i in 1:length(tau.vec)){
est <- siqr(y0,X,beta.initial = beta0, tau=tau.vec[i],maxiter = 30,tol = 1e-8)
est.coefficient[i,2:5] <- est$beta
}
colnames(est.coefficient) <- c("quantile tau",colnames(X))
est.coefficient
#> quantile tau RM logTAX PTRATIO logLSTAT
#> [1,] 0.25 0.3358285 -0.5243025 -0.06856117 -0.7795033
#> [2,] 0.50 0.3129182 -0.4294159 -0.06640472 -0.8445558
#> [3,] 0.75 0.2385613 -0.1933015 -0.07860687 -0.9484429
The estimated 0.25, 0.50, and 0.75 quantiles and their 95% pointwise
confidence bounds are plotted with the following codes and outputs.
est.tau25 <- siqr(y0,X,beta.initial = NULL, tau=0.25)
plot.siqr(est.tau25,bootstrap.interval = TRUE)
Figure 1: The R output of plot.siqr
with estimated 0.25 quantiles and the 95% pointwise confidence
bounds.
est.tau50 <- siqr(y0,X,beta.initial = NULL, tau=0.50)
plot.siqr(est.tau05,bootstrap.interval = TRUE)
Figure 2: The R output of plot.siqr
with estimated 0.50 quantiles and the 95% pointwise confidence
bounds.
est.tau75 <- siqr(y0,X,beta.initial = NULL, tau=0.75)
plot.siqr(est.tau75,bootstrap.interval = TRUE)
Figure 3: The R output of plot.siqr
with estimated 0.75 quantiles and the 95% pointwise confidence
bounds.
As the estimated single-index function curves are almost
monotonically increasing across different quantiles, variables that
contribute positively to the single index affect the response variable
(medv) positively. Based on the estimated coefficients and above plots,
we found that the number of rooms per house (rm) positively affects
different quantiles. This matches the intuition that people value large
spaces and multi-functional rooms. The property tax rate ln(tax) has a
negative impact on housing prices across different quantiles. However,
the influence of the tax rate is not significant at higher quantile
\(\tau\) = 0.75. That suggests the tax
rate may be less concerned for higher-income households, possibly due to
tax deduction towards their income tax. Both the pupil-teacher ratio
(ptratio) and the percentage of the lower (educational) status of the
population ln(lstat) show negative influences on housing values,
especially for the higher quantiles. It may suggest that potential
buyers prefer areas featuring solid educational resources for their
children and neighbors with higher education degrees and that preference
grows more vital for more expensive houses.
Simulation
We consider two simulation settings. In the first simulation example,
we use a sine-bump model with homoscedastic errors: \[\label{eq:sim1}
y=5 \sin \left(\frac{\pi \left( \textbf{ x}\mathbf{\beta}- A
\right)}{C-A} \right)+ 0.1 Z, (\#eq:sim1)\] where \(A=\frac{\sqrt{3}}{2}-\frac{1.645}{\sqrt{12}},
C=\frac{\sqrt{3}}{2}+\frac{1.645}{\sqrt{12}}\), \(\textbf{ x}\) is an \(n\times3\) design matrix that draws from an
independent uniform distribution with min of 0 and max of 1, and the
residual \(Z\) follows a standard
normal distribution. The true single-index parameter \(\mathbf{\beta}= (1,1,1)^{\intercal} /
\sqrt{3}\).
Table 1: Summary of parameter estimates for sine-bump
simulation example 1 of sample size \(n=400\). True \(\mathbf{\beta}= (1,1,1)^{\intercal} /
\sqrt{3}\). The sample mean, standard error (s.e.), and bias of
the parameter estimates of single-index coefficients from 200
replications.
|
mean |
0.5782 |
0.5727 |
0.5725 |
| \(\tau = 0.25\) |
s.e. |
0.0131 |
0.0281 |
0.0293 |
|
bias |
0.0009 |
-0.0046 |
-0.0048 |
|
mean |
0.5787 |
0.5755 |
0.5774 |
| \(\tau = 0.50\) |
s.e. |
0.0115 |
0.0105 |
0.0111 |
|
bias |
0.0014 |
-0.0018 |
0.0003 |
|
mean |
0.5803 |
0.5756 |
0.5757 |
| \(\tau = 0.75\) |
s.e. |
0.0119 |
0.0110 |
0.0118 |
|
bias |
0.0029 |
-0.0017 |
-0.0016 |
The single-index coefficients are estimated via a series of quantile
regressions with \(\tau = 0.25, 0.50,
0.75.\) Table 1 reports the mean, standard
error (s.e.), and bias for each parameter estimate with sample size
\(n=400\) over \(M=200\) replications on the simulation
example 1. One can see that the algorithm for our R package
SIQR is effective as the estimates are close to the true
values.
For demonstration purposes, we show codes to generate data from
(@ref(eq:sim1)) and fit the SIQR model using \(\tau = 0.50\) with 200 replications as
follows:
n <- 400
beta0 <- c(1, 1, 1)/sqrt(3)
n.sim <- 200
tau <- 0.50
data <- generate.data(n, true.theta=beta0, setting = "setting1",ncopy = n.sim)
sim.results.50 <- foreach(m = 1:n.sim,.combine = "rbind") %do% {
X <- data$X
Y <- data$Y[[m]]
est <- siqr(Y, X, beta.initial = c(2,1,0), tau=0.50,maxiter = 30,tol = 1e-8)
return(est$beta)
}
Note that this process has been repeated for the cases with \(\tau = 0.25, 0.75\). We obtain a box plot
of estimated single-index coefficients for \(\tau = 0.25, 0.50, 0.75\), respectively, by
applying the following code snippet.
boxplot(data.frame((sim.results.25)), outline=T,notch=T,range=1,
main = "Boxplots of Coefficient Estimates, tau = 0.25",horizontal = F)
Figure 4: The box plot of estimated single-index
coefficients for \(\tau = 0.25\) from
example 1.
boxplot(data.frame((sim.results.50)), outline=T,notch=T,range=1,
main = "Boxplots of Coefficient , tau = 0.50",horizontal = F)
Figure 5: The box plot of estimated single-index
coefficients for \(\tau = 0.50\) from
example 1.
boxplot(data.frame((sim.results.75)), outline=T,notch=T,range=1,
main = "Boxplots of Coefficient Estimates, tau = 0.75",horizontal = F)
Figure 6: The box plot of estimated single-index
coefficients for \(\tau = 0.75\) from
example 1.
Next, we consider a location-scale model as simulation example 2,
where both the location and the scale depend on a common index \(u=\textbf{ x}\mathbf{\beta}\). The
quantiles are “almost-linear-in-index" as in (K.
Yu and Jones 1998) when the single index u is close to zero:
\[\label{eq:sim2}
y=5 \cos (\textbf{ x}\mathbf{\beta})+ \exp(-(\textbf{
x}\mathbf{\beta})^2) + E, (\#eq:sim2)\] where \(\textbf{ x}\) is an \(n\times2\) design matrix that draws from an
independent normal distribution with a standard deviation of 0.25, and
the residual \(E\) follows an
exponential distribution with a mean 2. The single-index parameter \(\mathbf{\beta}= (1,2)^{\intercal} /
\sqrt{5}\).
The simulated data are generated with the following codes. The sample
size \(n = 400\) with 100 replications.
We only present the case when \(\tau =
0.50\) for demonstration purposes.
n <- 400
beta0 <- c(1, 2)/sqrt(5)
n.sim <- 100
tau <- 0.5
data <- generate.data(n, true.theta=beta0, setting = "setting3",ncopy = n.sim)
sim.results <- foreach(m = 1:n.sim,.combine = "rbind") %do% {
X <- data$X
Y <- data$Y[[m]]
est <- siqr(Y, X, beta.initial = NULL, tau=tau,maxiter = 30,tol = 1e-8)
est$beta
}
est.mean <- c(tau,apply(sim.results,2,mean))
names(est.mean) <- c("tau","beta1.hat","beta2.hat")
est.mean
est.mean <- cbind(p_vec,apply(sim_results,c(1,2),sd))
colnames(est.mean) <- c("quantile tau","X1","X2","X3")
est.mean
#> tau beta1.hat beta2.hat
#> 0.5 0.4515909 0.8917233
The average estimated single-index coefficients shown above are close
to the true single-index parameter \(\mathbf{\beta}= (1,2)^{\intercal} / \sqrt{5}
\approx(0.4472,0.8944)\). On top of that, the simulation standard
error is also reported as below:
est.se <- c(tau,apply(sim.results,2,sd))
names(est.se) <- c("tau","beta1.se.hat","beta1.se.hat")
est.se
#> tau beta1.se.hat beta1.se.hat
#> 0.5 0.02682211 0.01359602
Meanwhile, the following box plots show that the estimated
single-index coefficients are close to the true parameters with small
deviations.
boxplot(data.frame((sim.results)), outline=T,notch=T,range=1,
main = "Boxplots of Coefficient Estimates (100 replications)",horizontal = F)
Figure 7: The box plot of estimated single-index
coefficients for \(\tau = 0.50\) from
example 2.
Similarly, we plot the estimated quantiles and their 95% pointwise
confidence bounds with the provided plot function
plot.siqr. The observed data points are also plotted.
est.sim.50 <- siqr(data$Y[[1]],data$X,beta.initial = NULL, tau=0.5)
plot.siqr(est.sim.50,bootstrap.interval = TRUE)
Figure 8: The R output of plot.siqr
with estimated 0.50 quantiles and the 95% pointwise confidence bounds
from example 2.