Abstract
This paper introduces the R package HDSpatialScan. This package allows users to easily apply spatial scan statistics to real-valued multivariate data or both univariate and multivariate functional data. It also permits plotting the detected clusters and to summarize them. In this article the methods are presented and the use of the package is illustrated through examples on environmental data provided in the package.Spatial cluster detection methods are useful tools for objective
detection and localization of statistically significant aggregation of
events indexed in space. Examples of the applications of these methods
are numerous: in the field of epidemiology, these methods allow
epidemiologists to detect spatial clusters of disease cases and to
formulate etiological hypotheses; in the environmental sciences,
researchers can be led to search for particularly polluted geographical
areas, either by one pollutant in particular or by several pollutants
simultaneously. In astronomy, researchers may want to identify star
clusters from telescope image data.
Several cluster detection methods have been proposed in the literature.
In particular, spatial scan statistics (originally proposed by Martin Kulldorff and Nagarwalla (1995) and Martin Kulldorff (1997) for Bernoulli and
Poisson models) are powerful methods for detecting statistically
significant spatial clusters, which can be defined by an aggregation of
sites presenting an abnormal concentration (mean, etc) of an observed
variable, with a variable scanning window and in the absence of
pre-selection bias (objective detection of the cluster). Following on
from Kulldorff’s initial work, several researchers have adapted spatial
scan statistics to other spatial data distributions: exponential (Huang, Kulldorff, and Gregorio 2007), ordinal
(Jung, Kulldorff, and Klassen 2007),
normal (Martin Kulldorff, Huang, and Konty
2009), Weibull (Bhatt and Tiwari
2014), etc. Others use nonparametric approaches such as Jung and Cho (2015) and Cucala (2016) who respectively extend the
Wilcoxon-Mann-Whitney test for spatial scan statistics and for temporal
or spatial scan statistics. Note that in the case of spatial data the
two approaches are equivalent by generalizing the method of Jung and Cho (2015) to detect either high or low
clusters.
The applications of scan statistics are numerous. In the field of
epidemiology, Khan et al. (2021) detected
significant clusters of diabetes incidence in Florida between 2007 and
2010, which will help guide local health policies. Marciano et al. (2018) sought to detect spatial
clusters of leprosy incidence in a hyperendemic Brazilian municipality
between 2000 and 2005 and 2006 and 2010. The study showed a high
percentage of contact between people which facilitates the transmission
of the disease. Genin et al. (2020)
detected high-risk clusters of Crohn’s disease in France over the period
2007-2014. As the causes of this disease are still poorly understood,
the detection of spatial clusters of Crohn’s disease allows the
researchers to make hypotheses on possible risk factors, such as
high-social deprivation or high urbanization. In the context of
environmental science, the detection of clusters of symptomatic exposure
to pesticides in rural areas (Sudakin, Horowitz,
and Giffin 2002) would allow the monitoring and prevention of
pesticide-related diseases. Gao et al.
(2014) focused on the presence of iodine in drinking water in
Shandong Province, China. The detection of spatial clusters of iodine
presence in drinking water allows an improvement of the monitoring of
drinking water quality in these geographical areas. Finally in the
context of pollution data, (Wan et al.
2020) and (Shi, Liu, and Zhong
2021) respectively detected clusters of high concentrations of
\(\text{PM}_{2.5}\) in America and
China. Such results may allow local authorities to specifically monitor
these areas and make decisions to reduce pollution.
When multiple variables are observed simultaneously at each spatial
location, researchers may be interested in detecting spatial clusters
with anomalous values of all measured variables. In this context, Martin Kulldorff et al. (2007) proposed a
multivariate spatial scan statistic using a combination of independent
univariate scan statistics. However it fails to take into account the
potential correlations between the variables. A first spatial scan
statistic for multivariate data taking into account the correlations was
proposed by Cucala et al. (2017). Their
method is based on a multivariate normal probability model and a
likelihood ratio. Later, Cucala et al.
(2019) proposed a nonparametric spatial scan statistic for
multivariate data based on a multivariate Wilcoxon-Mann-Whitney
test.
Technological developments in measurement tools and data storage
capacity have yielded to the increasing use of sensors, cell phones and
more generally connected devices that collect data continuously or
almost continuously over time. This has led to the introduction of new
analysis methods for functional data (J. Ramsay
and Silverman 2005), as well as the adaptation of classical
statistical methods such as principal component analysis (Boente and Fraiman 2000; Berrendero, Justel, and Svarc
2011) or regression (Cuevas,
Febrero-Bande, and Fraiman 2002; Ferraty and Vieu 2002; Chiou and Müller
2007).
In the field of spatial scan statistics, Frévent
et al. (2021a) and Smida et al.
(2022) proposed new methods for univariate processes. However for
example, in environmental surveillance, numerous variables are
simultaneously measured, making a multivariate functional approach
necessary to detect environmental black-spots. These can be defined as
geographical areas characterized by elevated concentrations of multiple
pollutants. Although Smida et al. (2022)
only studied their approach in the univariate functional framework, they
suggest that it could also be adapted for multivariate processes. Frévent et al. (2021b) studied this adaptation
and also developed new efficient methods for multivariate functional
spatial scan statistics.
In R several packages provide spatial scan statistics implementations.
The best known is certainly the rsatscan
(Kleinman 2015) package which provides
functions to interface R and the SaTScan software (Martin Kulldorff 2021), allowing the latter to
be launched from R. It implements lots of univariate methods (ordinal,
Bernoulli, Poisson, …) but also the space-time spatial scan statistic
(M. Kulldorff et al. 1998) and the
multivariate extensions proposed by Martin
Kulldorff et al. (2007). The function kulldorff
implemented in the R package SpatialEpi
(Chen et al. 2018) also performs the
spatial scan statistics based on the Poisson and the Bernoulli models.
Other softwares were created to detect clusters such as ClusterSeer
(Greiling et al. 2012; Durbeck et al.
2012) which performs spatial, temporal and space-time clustering,
and TreeScan (Martin Kulldorff 2018) which
implements the tree-based scan statistic (Martin
Kulldorff, Fang, and Walsh 2003). We should also mention the R
package DCluster
(Gómez-Rubio, Ferrándiz-Ferragud, and
López-Quı́lez 2015) which implements the spatial scan statistics
for Poisson or Bernoulli models. The R package DClusterm
(Gómez-Rubio et al. 2019; Gomez-Rubio, Serrano,
and Rowlingson 2020) also implements a cluster detection method.
Briefly, it consists in considering a large number of generalized linear
models by including potential cluster indicators one by one, and then to
use a model selection procedure. The Shiny application SpatialEpiApp
(Moraga 2017a) and the R package SpatialEpiApp
(Moraga 2017b) allow the detection and
visualization of clusters by using the scan statistics implemented in
SaTScan. Finally the software FlexScan (Takahashi, Yokoyama, and Tango 2010) and the R
package rflexscan
(Otani and Takahashi 2021) implement the
spatial scan statistic using a scan window with a non pre-defined shape,
defined by Takahashi and Tango (2005).
Other R packages also allow clusters detection such as graphscan
(Loche et al. 2016) (the
cluster function), SPATCLUS (Dematteı̈, Molinari, and Daurès 2006) or scanstatistics
(Allévius 2018a, 2018b) for spatial or
space-time data. It should be noted that these last two packages are no
longer available on the CRAN (The Comprehensive R Archive Network)
repository. Although existing packages implement a large number of
statistical spatial scan models, none of them propose multivariate scan
models taking into account the potential correlations between variables
or scan models for functional data. Thus, we have developed the R
package HDSpatialScan for high-dimensional spatial scan
statistics. The latter allows on the one hand the detection of spatial
clusters in multivariate or functional data, and on the other hand,
their display on a map and the description of their characteristics.
This paper is organized as follows: The following section presents
the different models implemented in the R package
HDSpatialScan. Then, the implementation of the methods is
described and examples of use of the package are given. The last section
concludes the paper.
Let \(s_1, \dots, s_n\) be \(n\) different locations of an observation domain \(S \subset \mathbb{R}^2\) and \(X_1, \dots, X_n\) be the observations of a variable \(X\) in \(s_1, \dots, s_n\). Hereafter all observations are considered to be independent, which is a classical assumption in scan statistics. Three types of spatial data can be considered: either lattice data (the data are aggregated at the spatial level, e.g.: county), geostatistical data (the variable is defined on a continuous area and each individual measure corresponds to a unique fixed spatial location, e.g.: pollutant concentration measured by sensors over a region), or marked point data (each individual measure corresponds to a unique random spatial location, e.g.: height of the trees in a forest, the location of the trees is random).
Spatial scan statistics aim at detecting spatial clusters and testing their statistical significance. Hence, one tests a null hypothesis \(\mathcal{H}_0\) (the absence of a cluster) against a composite alternative hypothesis \(\mathcal{H}_1\) (the presence of at least one cluster \(w \subset S\) presenting abnormal values of \(X\)). For this purpose, a spatial scan statistic consists of two steps. The first one is a detection phase using a scanning window of variable size and shape. We will focus here on the approach of Martin Kulldorff and Nagarwalla (1995) which use a circular scanning window of variable center and radius, however it should be noted that other shapes can be considered (Martin Kulldorff et al. 2006; Cucala et al. 2013). An approach often advised is to limit the maximum size to half of the studied region since otherwise it would be like detecting a “negative cluster” in the areas outside the clusters covering almost all the studied region (Martin Kulldorff and Nagarwalla 1995). Then the scanning window allows to define a set of potential clusters \(\mathcal{W}\) by \[\label{eq:cluster} \mathcal{W} = \{ w_{i,j} \ / \ 1 \le |w_{i,j}| \le \frac{n}{2}, \ 1 \le i,j \le n \}, (\#eq:cluster)\] where \(w_{i,j}\) is the disc centered on \(s_i\) that passes through \(s_j\) and \(|w_{i,j}|\) corresponds to the number of sites in \(w_{i,j}\). Figure 1 illustrates the set of potential clusters defined with a circular scanning window with Equation @ref(eq:cluster) on a set of eight administrative areas in France.
Then the spatial scan statistic can be defined as the maximum of a concentration index over the set of potential clusters \(\mathcal{W}\). The second step is the determination of the statistical significance of the spatial scan statistic. For this, since the distribution of the scan statistic is intractable under \(\mathcal{H}_0\) due to the overlapping nature of \(\mathcal{W}\), a common approach, which will be considered here, is to use a Monte-Carlo method (see Section 2.4 for more details).
Here we consider the case where several continuous variables are
simultaneously observed in each spatial location: \(X = (X^{(1)}, \dots, X^{(p)})^\top\) is a
\(p\)-dimensional variable (\(p \ge 2\)). In this context the objective
is to identify multivariate spatial clusters that are aggregations of
sites in which \(X\) takes higher or
lower values (in terms of mean, median, etc.) than elsewhere. For
example one could observe the average concentrations of several
pollutants over a day: a vector can be associated with each site, each
element of which corresponds to the average concentration of one
pollutant. In this context a spatial cluster corresponds to a set of
sites under or overexposed to multiple pollutants. Different approaches
will be presented: a parametric method based on a Gaussian model and a
nonparametric one.
Figure 2 summarizes the different types
of multivariate data with examples, and provides guidelines on the
spatial scan statistics methods to be used for these data (and the
argument to use in the scan function of the package). More precisely, we
distinguish three types of spatial data: lattice data which are
aggregated data for example at the scale of the regions of a country,
geostatistical data which are defined on a continuous space (typically
temperature, sunshine, or atmospheric pressure) although they are
observed only at discrete sites, and marked point data for which the
location is random (for example the distribution of trees in a forest)
and we observe at each location the circumference and height of the tree
for example. To detect spatial clusters, in the case of Gaussian data we
will prefer the Gaussian approach (MG) and otherwise we will use the
nonparametric approach (MNP).
Cucala et al. (2017) proposed a
parametric spatial scan statistic for multivariate data based on a
multivariate normal model taking into account the correlations between
the variables.
The null hypothesis \(\mathcal{H}_0\),
corresponding to the absence of any cluster in the data, is the
following: \(\forall i \in \left[\!\left[ 1 ;
n \right]\!\right], \ X_i \sim \mathcal{N}_p(\mu, \Sigma)\) and
the alternative hypothesis \(\mathcal{H}_1^{(w)}\) associated with a
potential cluster \(w\) can be defined
as: \(\forall i \in \left[\!\left[ 1 ; n
\right]\!\right], \ X_i \sim \left\{ \begin{array}{ll}
\mathcal{N}_p(\mu_w, \Sigma_{w,w^\mathsf{c}}) & \text{ if } s_i \in
w \\
\mathcal{N}_p(\mu_{w^\mathsf{c}}, \Sigma_{w,w^\mathsf{c}}) & \text{
otherwise}
\end{array} \right.\).
Then we can compute the MLE estimates of \(\mu, \mu_w, \mu_{w^\mathsf{c}}, \Sigma\)
and \(\Sigma_{w,w^\mathsf{c}}\): \(\hat{\mu}, \widehat{\mu}_w,
\widehat{\mu}_{w^\mathsf{c}}, \widehat{\Sigma}\) and \(\widehat{\Sigma}_{w,w^\mathsf{c}}\), and we
can show that the log-likelihood ratio between these two hypotheses is
\[\begin{aligned}
\widehat{LLR^w} =
&-\frac{n}{2}\ln\bigg[\text{det}\Big(\sum_{\substack{i\\ s_i \in w}}
\left(X_i - \widehat{\mu}_w \right) \left(X_i - \widehat{\mu}_w
\right)^\top + \sum_{\substack{i \\ s_i \in w^\mathsf{c}}} \left(X_i -
\widehat{\mu}_{w^\mathsf{c}} \right) \left(X_i -
\widehat{\mu}_{w^\mathsf{c}} \right)^\top \Big)\bigg] \\
&+ \frac{n}{2}\ln\bigg[\text{det}\Big(\sum_{i=1}^{n} \left(X_i -
\hat{\mu} \right) \left(X_i - \hat{\mu} \right)^\top \Big)\bigg],
\end{aligned}\] where \(\displaystyle{
\hat{\mu}_g = \dfrac{1}{|g|} \sum_{i, s_i \in g} X_i }\) for
\(g \in \{ w, w^\mathsf{c} \}\) and
\(\displaystyle{ \hat{\mu} = \dfrac{1}{n}
\sum_{i=1}^n X_i }\).
Finally the log-likelihood ratio is used as a concentration index and
maximised over the set of potential clusters \(\mathcal{W}\).
Thus we can show that the multivariate Gaussian (MG) scan statistic is
\[\lambda_{\text{MG}} = \underset{w \in
\mathcal{W}}{\min} \ \text{det}\Big(\sum_{\substack{i\\ s_i \in w}}
\left(X_i - \widehat{\mu}_w \right) \left(X_i - \widehat{\mu}_w
\right)^\top + \sum_{\substack{i \\ s_i \in w^\mathsf{c}}} \left(X_i -
\widehat{\mu}_{w^\mathsf{c}} \right) \left(X_i -
\widehat{\mu}_{w^\mathsf{c}} \right)^\top \Big).\]
This test performs very well against Gaussian alternatives but faces
problems when the data is not normal, which is often the case when
dealing with environmental data exhibiting extreme values. For that
reason Cucala et al. (2019) developed a
nonparametric spatial scan statistic for multivariate data based on a
multivariate extension of the Wilcoxon-Mann-Whitney test for
multivariate data (Oja and Randles
2004).
In this context the null hypothesis \(\mathcal{H}_0\) can be rewritten as \(\mathcal{H}_0: X_1, \dots, X_n\) are
identically distributed, whatever the associated location.
Let \[\begin{array}{ccccl}
\text{sgn} & : & \mathbb{R}^p & \to & \mathbb{R}^p \\
& & x & \mapsto & \left\{
\begin{array}{cl}
||x||_2^{-1} x & \text{ if } x \neq 0 \\
0 & \text{ otherwise}
\end{array}
\right. \\
\end{array},\] then the multivariate ranks \(R_i\) are defined by \(\displaystyle{R_i = \dfrac{1}{n} \sum_{j=1}^{n}
\text{sgn}(A_X(X_i - X_j))}\) where the matrix \(A_X\) makes the ranks such that \(\displaystyle{\dfrac{p}{n}\sum_{i=1}^{n} R_i
R_i^\top = \dfrac{1}{n} \sum_{i=1}^{n} R_i^\top R_i I_p}\). Note
that this matrix can be easily computed using an iterative procedure.
Then the multivariate extension of the Wilcoxon-Mann-Whitney statistic
proposed by Oja and Randles (2004) is
\[U^2(w) = \dfrac{p}{c^2_X} \left[\ |w| \
||\bar{R}_w||_2^2 + |w^\mathsf{c}| \ ||\bar{R}_{w^\mathsf{c}}||_2^2 \
\right], \text{ where } \displaystyle{c^2_X = \dfrac{1}{n}\sum_{i=1}^n
R_i^\top R_i}.\]
Cucala et al. (2019) used \(U^2(w)\) as a concentration index to build the spatial scan statistic: the multivariate nonparametric (MNP) scan statistic is \(\lambda_{\text{MNP}} = \underset{w \in \mathcal{W}}{\max} \ U^2(w)\).
It should be noted that in the case \(p=1\), these statistics are respectively equivalent to the ones introduced by Martin Kulldorff, Huang, and Konty (2009) (which is equivalent to the scan statistic developed by Cucala (2014), UG), and Jung and Cho (2015) (UNP).
Here we consider the case where a continuous variable is observed in
each spatial location over time: \(\{ X(t), \
t \in \mathcal{T} \}\) is a real-valued stochastic process where
\(\mathcal{T}\) is an interval of \(\mathbb{R}\). In this context the objective
is to identify functional spatial clusters that are aggregations of
sites in which the curves are higher or lower than elsewhere. For
example, one can observe the concentration of an air pollutant over time
in different geographical areas. Then a cluster corresponds to an
aggregation of sites in which the concentration of the air pollutant is
higher or lower over the time than in the other spatial units. Several
methods will be considered: a parametric method based on a functional
ANOVA, a nonparametric approach using a Wilcoxon-Mann-Whitney test for
high-dimensional data, a distribution-free approach based on a pointwise
Student’s t-test and finally a pointwise rank-based method. On Gaussian
data, for non localized clusters in time all approaches show high power
and high true positive rates. However the performances of the
ANOVA-based method strongly decrease on non-normal data. For localized
clusters in time (that are aggregations of sites that take higher or
lower values for \(X\) only in a small
interval of time (an interval of five days over a study period of one
month for example)) the pointwise approaches should be favored.
Figure 3 summarizes the different
types of univariate functional data with examples, and provides
recommendations on the spatial scan statistics methods to be used for
these data (and the argument to use in the scan function of the
package). More precisely, we distinguish lattice functional data which
are aggregated functional data for example at the scale of the
administrative areas of a country (unemployment rate, percentage of the
population over 65, etc), geostatistical functional data which are
defined on a continuous space (typically temperature, sunshine, or
atmospheric pressure over time) although they are observed only at
discrete sites, and marked point data for which the location is random
(for example the distribution of trees in a forest) and we observe at
each location the circumference of the tree over time for example. To
detect spatial clusters, as mentioned before, in the case of Gaussian
data we will prefer the pointwise distribution-free functional approach
(DFFSS) and otherwise we will use the pointwise rank-based approach
(URBFSS).
Frévent et al. (2021a) supposed that
the process \(X\) takes values in a
semi-metric space, in particular in \(\mathcal{L}^2(\mathcal{T},\mathbb{R})\) and
proposed a parametric spatial scan statistic for functional data, based
on a functional ANOVA. Here the null hypothesis \(\mathcal{H}_0\) can be rewritten: \(\mathcal{H}_0: \forall w \in \mathcal{W}, \
\mu_{w} = \mu_{w^\mathsf{c}} = \mu_S\), and the alternative
hypothesis \(\mathcal{H}_1^{(w)}\)
associated with a potential cluster \(w\) can be defined as follows: \(\mathcal{H}_1^{(w)}: \mu_{w} \neq
\mu_{w^\mathsf{c}}\), where \(\mu_w\), \(\mu_{w^\mathsf{c}}\) and \(\mu_S\) stand for the mean functions in
\(w\), outside \(w\) and over \(S\), respectively. Cuevas, Febrero-Bande, and Fraiman (2004) and
Górecki and Smaga (2015) proposed the
following ANOVA test statistic: \[\displaystyle{F_n^{(w)} = \dfrac{
|w| \ ||\bar{X}_{w} - \bar{X}||_2^2 +
|w^\mathsf{c}| \ ||\bar{X}_{w^\mathsf{c}} - \bar{X}||_2^2}{
\dfrac{1}{n-2}
\left[\sum_{j, s_j \in w} ||X_j - \bar{X}_{w}||_2^2 + \sum_{j, s_j \in
w^\mathsf{c}} ||X_j - \bar{X}_{w^\mathsf{c}}||_2^2 \right]},}\]
where \(\displaystyle{\bar{X}_{g}(t) =
\dfrac{1}{|g|} \sum_{i, s_i \in g} X_i(t)}\) are empirical
estimators of \(\mu_g\) (\(g \in \{w, w^\mathsf{c}\}\)), \(\displaystyle{\bar{X}(t) = \dfrac{1}{n} \sum_{i =
1}^n X_i(t)}\) is the empirical estimator of \(\mu_S\) and \(\displaystyle{||x||_2^2 = \int_{\mathcal{T}}
x^2(t) \ \text{d}t}\).
Thus, Frévent et al. (2021a) proposed to
use \(F_n^{(w)}\) as a concentration
index and the proposed parametric functional spatial scan statistic
(PFSS) is \(\Lambda_{\text{PFSS}} =
\underset{w \in \mathcal{W}}{\max} \ F_n^{(w)}\).
This method gives high powers and F-measures on normal data but as in
the multivariate framework the parametric method faces problems when the
data is not normal. Smida et al. (2022)
proposed a nonparametric spatial scan statistic for functional data
based on a functional Wilcoxon-Mann-Whitney test (Chakraborty and Chaudhuri 2014).
Here \(X\) is a process of a smooth
Banach space \(\chi\), with a Gâteaux
differentiable norm \(||.||_{\chi}\).
Let denote \(P_w\) and \(P_{w^\mathsf{c}}\) the probability measures
of \(X\) in \(w\) and in \(w^\mathsf{c}\) respectively, then \(\mathcal{H}_0\) corresponds to: \(\mathcal{H}_0: \forall w \in \mathcal{W}, \ P_w =
P_{w^\mathsf{c}}\) and the alternative hypothesis associated with
a potential cluster \(w\) can be
rewritten as \(\mathcal{H}_1^{(w)}: P_w(X) =
P_{w^\mathsf{c}}(X-\Delta), \ \Delta \in \chi \backslash
\{0\}\).
Chakraborty and Chaudhuri (2014) defined
the sign function in the functional framework as \[\forall h \in \chi, \ \text{Sgn}_X(h) = \left\{
\begin{array}{cl} \underset{v \rightarrow 0^+}{\lim} \ \dfrac{|| X + vh
||_\chi - ||X||_\chi}{v} &\text{ if } X \neq 0 \\
0 &\text{ if } X = 0 \end{array} \right. .\] Then they
proposed the following test statistic: \[T_{WMW}(w) = \dfrac{1}{|w| |w^\mathsf{c}|}
\sum_{i, s_i \in w} \sum_{j, s_j \in w^\mathsf{c}} \text{Sgn}_{X_j -
X_i}.\] Under \(\mathcal{H}_0\),
if \(\dfrac{|w|}{n} \rightarrow \gamma \in
[0;1]\) as \(|w|, |w^\mathsf{c}|
\rightarrow \infty\), \(\sqrt{\dfrac{|w| |w^\mathsf{c}|}{n}}
T_{WMW}(w)\) converges weakly to a distribution that does not
depend on \(|w|\). Thus Smida et al. (2022) proposed to use \(U(w) = \bigg| \bigg| \sqrt{\dfrac{|w|
|w^\mathsf{c}|}{n}} T_{WMW}(w) \bigg| \bigg|\) as a concentration
index: the nonparametric functional scan statistic (NPFSS) is \(\Lambda_{\text{NPFSS}} = \underset{w \in
\mathcal{W}}{\max} \ U(w)\).
It should be noticed that although Smida et al.
(2022) only studied the performances of the NPFSS in the
univariate functional framework, their method is also applicable on
multivariate functional data as shown by Frévent
et al. (2021b).
Frévent et al. (2021a) also proposed to
combine the distribution-free spatial scan statistic for univariate data
proposed by Cucala (2014) and the max
statistic of Lin, Lopes, and Müller
(2021). They supposed that for each time \(t\), \(\mathbb{V}[X_i(t)] = \sigma^2(t)\) for all
\(i \in \left[\!\left[ 1 ; n
\right]\!\right]\). Then for each \(t\), the concentration index proposed by
Cucala (2014) to test \(\mathcal{H}_0: \forall w \in \mathcal{W}, \mu_w(t)
= \mu_{w^\mathsf{c}}(t) = \mu_S(t)\) was \[I^{(w)}(t) = \dfrac{|\bar{X}_w(t) -
\bar{X}_{w^\mathsf{c}}(t)|}{\sqrt{\hat{\mathbb{V}}[\bar{X}_w(t) -
\bar{X}_{w^\mathsf{c}}(t)]}},\] where \(\hat{\mathbb{V}}[\bar{X}_w(t) -
\bar{X}_{w^\mathsf{c}}(t)] = \widehat{\sigma^2}(t) \left[ \dfrac{1}{|w|}
+ \dfrac{1}{|w^\mathsf{c}|} \right]\),
\(\displaystyle{\widehat{\sigma^2}(t) =
\dfrac{1}{n-2} \left[ \sum_{i, s_i \in w} (X_i(t) - \bar{X}_w(t))^2 +
\sum_{i, s_i \in w^\mathsf{c}} (X_i(t) - \bar{X}_{w^\mathsf{c}}(t))^2
\right]}\).
Then the idea is to globalize the information by maximizing the previous
quantity over the time for each potential cluster \(w\), as suggested by Lin, Lopes, and Müller (2021): \[I^{(w)} = \underset{t \in \mathcal{T}}{\sup} \
I^{(w)}(t).\]
For cluster detection, as for the PFSS, the null hypothesis \(\mathcal{H}_0\) (the absence of cluster) is
defined as follows: \(\mathcal{H}_0: \forall w
\in \mathcal{W}, \ \mu_{w} = \mu_{w^\mathsf{c}} = \mu_S\). And
the alternative hypothesis \(\mathcal{H}_1^{(w)}\) associated with a
potential cluster \(w\) can be defined
as follows: \(\mathcal{H}_1^{(w)}: \mu_{w}
\neq \mu_{w^\mathsf{c}}\).
Frévent et al. (2021a) considered \(I^{(w)}\) as a concentration index and
maximized it over the set of potential clusters \(\mathcal{W}\) yielding to the following
distribution-free functional spatial scan statistic (DFFSS): \(\Lambda_{\text{DFFSS}} = \underset{w \in
\mathcal{W}}{\max} \ I^{(w)}\).
A pointwise approach based on ranks and on the nonparametric scan statistic for univariate data (Jung and Cho 2015) can be proposed in the univariate functional framework by adapting the approach of Frévent et al. (2021b).
For a time \(t\), Jung and Cho (2015) proposed to test \(\mathcal{H}_0: \forall w \in \mathcal{W}, F_{w,t}
= F_{w^\mathsf{c},t}\) where \(F_{w,t}\) and \(F_{w^\mathsf{c},t}\) are the cumulative
distribution functions of \(X(t)\) in
\(w\) and outside \(w\), by using the Wilcoxon rank-sum test
statistic. For a time \(t\) and a
potential cluster \(w\), the Wilcoxon
rank-sum test statistic is \(\displaystyle{W(t)^{(w)} = \sum_{i, s_i \in w}
R_i(t)}\) where \(R_i(t)\) is
the rank of \(X_i(t)\) in \(\{ X_1(t), \dots, X_n(t) \}\), using the
average rank in the case of tied observations.
Then the standardized version of this statistic is \[T(t)^{(w)} = \dfrac{W(t)^{(w)} -
\mathbb{E}[W(t)^{(w)}]}{\sqrt{\mathbb{V}[W(t)^{(w)}]}}\] where
\(\mathbb{E}[W(t)^{(w)}] =
\dfrac{|w|(n+1)}{2}\) and \(\mathbb{V}[W(t)^{(w)}] = \dfrac{|w| |w^\mathsf{c}|
(n+1)}{12}\) are the expected value and the variance of \(W(t)^{(w)}\) under \(\mathcal{H}_0\).
Jung and Cho (2015) proposed to minimize
the p-value associated with \(T(t)^{(w)}\) on the set of potential
clusters \(\mathcal{W}\). We propose to
adapt their approach by simply using \(|T(t)^{(w)}|\) as a pointwise
statistic.
In the context of cluster detection, the null hypothesis is defined as
\(\mathcal{H}_0\): \(\forall w \in \mathcal{W}, \ \forall t \in
\mathcal{T}, \ F_{w,t} = F_{w^\mathsf{c},t}\). The alternative
hypothesis \(\mathcal{H}_1^{(w)}\)
associated with a potential cluster \(w\) is \(\mathcal{H}_1^{(w)}\): \(\exists t \in \mathcal{T}, \ F_{w,t}(x) =
F_{w^\mathsf{c},t}(x-\Delta_t)\), \(\Delta_t \neq 0\).
As before, we propose to globalize the information over the time with
\(T^{(w)} = \underset{t \in \mathcal{T}}{\sup}
\ |T(t)^{(w)}|\) and to use this quantity as a concentration
index, yielding to the following univariate rank-based functional
spatial scan statistic (URBFSS): \(\Lambda_{\text{URBFSS}} = \underset{w \in
\mathcal{W}}{\max} \ T^{(w)}\).
Here we consider the case where several continuous variables are
observed simultaneously in each spatial unit over time: \(\{ X(t), \ t \in \mathcal{T} \}\) is a
\(p\)-dimensional vector-valued
stochastic process (\(p \ge 2\)) where
\(\mathcal{T}\) is an interval of \(\mathbb{R}\). The objective is to detect
multivariate functional spatial clusters that are aggregations of sites
in which the curves are higher or lower than elsewhere. For example we
can observe the concentration of several pollutants over time in
different locations. Thus at each location we observe several processes
(air pollutant concentrations) and these processes can be correlated. In
this context a cluster is an aggregation of sites overexposed or
underexposed to multiple pollutants over time. Several methods will be
presented: a parametric method based on a functional MANOVA, a
distribution-free approach based on a pointwise Hotelling \(T^2\)-test and finally a pointwise
rank-based method. On normal data, all approaches show high power and
high true positive rates for non localized clusters in time. However the
performances of the methods based on the MANOVA and the Hotelling \(T^2\)-test decrease on non-normal data. For
localized clusters in time the pointwise approaches should be favored,
especially the pointwise rank-based method on non-Gaussian data. By
localized clusters in time we mean aggregations of sites that take
higher or lower values for \(X\) only
in a small interval of time (an interval of five days over a study
period of one month for example).
Figure 4 summarizes the different
types of multivariate functional data with examples, and provides
guidelines on the spatial scan statistics methods to be used for these
data (and the argument to use in the scan function of the package). To
be more precise, we can distinguish lattice functional data which are
aggregated data for example at the scale of the regions of a country
(unemployment rate and fraction of the population that has not graduated
from high school, over time, for example), geostatistical functional
data which are defined on a continuous space (temperature, sunshine, and
atmospheric pressure over time) although they are observed only at
discrete sites, and marked point data for which the location is random
(for example the distribution of trees in a forest) and we observe at
each location the circumference and height of the tree over time for
example. As previously mentioned, to detect spatial clusters, in the
case of Gaussian data we will prefer the multivariate distribution-free
functional spatial scan statistic (MDFFSS) and for non-Gaussian data we
will use the pointwise rank-based approach (MRBFSS).
Here, the process \(X\) is supposed to take values in a semi-metric space, in particular the Hilbert space \(\mathcal{L}^2(\mathcal{T}, \mathbb{R}^p)\) of \(p\)-dimensional vector-valued square-integrable functions on \(\mathcal{T}\), equipped with the inner product \(\displaystyle{\langle X, Y \rangle = \int_{\mathcal{T}} X(t)^\top Y(t) \ \text{d}t}\).
Frévent et al. (2021b) proposed a
parametric scan statistic for multivariate functional data based on a
functional MANOVA Lawley–Hotelling trace test (Górecki and Smaga 2017).
In this context, the null hypothesis \(\mathcal{H}_0\) is \(\mathcal{H}_0: \forall w \in \mathcal{W}, \
\mu_{w} = \mu_{w^\mathsf{c}} = \mu_S\), where \(\mu_w\), \(\mu_{w^\mathsf{c}}\) and \(\mu_S\) stand for the mean functions in
\(w\), outside \(w\) and over \(S\), respectively. And the alternative
hypothesis \(\mathcal{H}_1^{(w)}\)
associated with a potential cluster \(w\) is \(\mathcal{H}_1^{(w)}: \mu_{w} \neq
\mu_{w^\mathsf{c}}\). Górecki and Smaga
(2017) presented the following adaptation of the Lawley-Hotelling
trace test statistic: \[\mathrm{LH}^{(w)} =
\text{Trace}(H_w E_w^{-1})\] where \(\displaystyle{H_w = |w| \int_{\mathcal{T}}
\left[\bar{X}_w(t) - \bar{X}(t) \right] \left[\bar{X}_w(t) - \bar{X}(t)
\right]^\top \ \text{d}t + |w^\mathsf{c}| \int_{\mathcal{T}}
\left[\bar{X}_{w^\mathsf{c}}(t) - \bar{X}(t) \right]
\left[\bar{X}_{w^\mathsf{c}}(t) - \bar{X}(t) \right]^\top \
\text{d}t}\)
and \(\displaystyle{E_w = \sum_{j, s_j \in w}
\int_{\mathcal{T}} \left[X_j(t) - \bar{X}_w(t) \right] \left[X_j(t) -
\bar{X}_w(t) \right]^\top \text{d}t + \sum_{j, s_j \in w^\mathsf{c}}
\int_{\mathcal{T}} \left[X_j(t) - \bar{X}_{w^\mathsf{c}}(t) \right]
\left[X_j(t) - \bar{X}_{w^\mathsf{c}}(t) \right]^\top
\text{d}t}\) with \(\displaystyle{\bar{X}_{g}(t) = \dfrac{1}{|g|}
\sum_{i, s_i \in g} X_i(t)}\) the empirical estimators of \(\mu_g(t)\) for \(g \in \{w, w^\mathsf{c}\}\) and \(\displaystyle{\bar{X}(t) = \dfrac{1}{n} \sum_{i =
1}^n X_i(t)}\) the empirical estimator of \(\mu_S(t)\).
Then Frévent et al. (2021b) considered
\(\mathrm{LH}^{(w)}\) as a
concentration index and proposed the parametric multivariate functional
spatial scan statistic (MPFSS): \(\Lambda_{\text{MPFSS}} = \underset{w \in
\mathcal{W}}{\max} \ \mathrm{LH}^{(w)}\).
In fact Górecki and Smaga (2017) proposed four test statistics using the matrices \(H_w\) and \(E_w\) to compare the mean functions in \(w\) and \(w^\mathsf{c}\): (1) the Lawley–Hotelling trace test statistic \(\text{LH}^{(w)} = \text{Trace}(H_w E_w^{-1})\), (2) the Pillai trace test statistic \(\text{P}^{(w)} = \text{Trace}(H_w(H_w+E_w)^{-1})\), (3) the Roy’s largest root test statistic \(\text{R}^{(w)} = \lambda_{\text{max}}(H_w E_w^{-1})\) where \(\lambda_{\text{max}}(H_w E_w^{-1})\) is the maximum eigenvalue of \(H_w E_w^{-1}\) and (4) the Wilks lambda test statistic \(\text{W}^{(w)} = \dfrac{\text{det}(E_w)}{\text{det}(H_w+E_w)}\).
Thus each of these quantities (or the opposite for the Wilks lambda test statistic) can be considered as a concentration index and maximized over \(\mathcal{W}\) which results in the following parametric multivariate functional spatial scan statistics:
\[\Lambda_{\text{LH}} = \underset{w \in \mathcal{W}}{\max} \ \text{LH}^{(w)}, \ \ \Lambda_{\text{P}} = \underset{w \in \mathcal{W}}{\max} \ \text{P}^{(w)}, \ \ \Lambda_{\text{R}} = \underset{w \in \mathcal{W}}{\max} \ \text{R}^{(w)}, \ \ \Lambda_{\text{W}} = \underset{w \in \mathcal{W}}{\min} \ \text{W}^{(w)}.\]
These four approaches are implemented in the package HDSpatialScan.
Frévent et al. (2021b) proposed a
distribution-free spatial scan statistic for multivariate functional
data which is the counterpart of the distribution-free spatial scan
statistic for univariate functional data developed by Frévent et al. (2021a). They supposed that for
each time \(t\), \(\mathbb{V}[X_i(t)] = \Sigma(t,t)\) for all
\(i \in \left[\!\left[ 1 ; n
\right]\!\right]\), where \(\Sigma\) is a \(p
\times p\) covariance matrix function.
Thus, as previously, in the context of cluster detection, the null
hypothesis \(\mathcal{H}_0\) can be
defined as follows: \(\mathcal{H}_0: \forall w
\in \mathcal{W}, \ \mu_{w} = \mu_{w^\mathsf{c}} = \mu_S\), where
\(\mu_w\), \(\mu_{w^\mathsf{c}}\) and \(\mu_S\) stand for the mean functions in
\(w\), outside \(w\) and over \(S\), respectively. And the alternative
hypothesis \(\mathcal{H}_1^{(w)}\)
associated with a potential cluster \(w\) can be defined as follows: \(\mathcal{H}_1^{(w)}: \mu_{w} \neq
\mu_{w^\mathsf{c}}\). Next, Qiu, Chen, and
Zhang (2021) proposed to compare the mean function \(\mu_w\) in \(w\) with the mean function \(\mu_{w^\mathsf{c}}\) in \(w^\mathsf{c}\) by using the following
statistic: \[T_{n,\text{max}}^{(w)} =
\underset{t \in \mathcal{T}}{\sup} \ T_n(t)^{(w)}\] where \(T_n(t)\) is a pointwise statistic defined
by the Hotelling \(T^2\)-test statistic
\[T_n(t)^{(w)} = \dfrac{|w|
|w^\mathsf{c}|}{n} \left(\bar{X}_w(t) - \bar{X}_{w^\mathsf{c}}(t)
\right)^\top \hat{\Sigma}(t,t)^{-1} \left(\bar{X}_w(t) -
\bar{X}_{w^\mathsf{c}}(t) \right).\] \(\bar{X}_w(t)\) and \(\bar{X}_{w^\mathsf{c}}(t)\) are the
empirical estimators of the mean functions defined previously, and
\(\displaystyle{\hat{\Sigma}(s,t) =
\dfrac{1}{n-2} \left[ \sum_{i, s_i \in w} \left(X_i(s) - \bar{X}_w(s)
\right) \left(X_i(t) - \bar{X}_w(t) \right)^\top + \sum_{i, s_i \in
w^\mathsf{c}} \left(X_i(s) - \bar{X}_{w^\mathsf{c}}(s) \right)
\left(X_i(t) - \bar{X}_{w^\mathsf{c}}(t) \right)^\top \right]}\)
is the pooled sample covariance matrix function.
Then Frévent et al. (2021b) proposed to
use \(T_{n,\text{max}}^{(w)}\) as a
concentration index and to maximize it over the set of potential
clusters \(\mathcal{W}\): the
multivariate distribution-free functional spatial scan statistic
(MDFFSS) is \(\Lambda_{\text{MDFFSS}} =
\underset{w \in \mathcal{W}}{\max}
\ T_{n,\text{max}}^{(w)}\).
Finally Frévent et al. (2021b) also
proposed to consider as a pointwise test statistic the multivariate
extension of the Wilcoxon rank-sum test statistic developed by Oja and Randles (2004) and detailed in
Section 2.1. They defined the pointwise
multivariate ranks as \(\displaystyle{R_i(t) =
\dfrac{1}{n} \sum_{j=1}^{n} \text{sgn}(A_X(t)(X_i(t) -
X_j(t)))}\) where the pointwise transformation matrix \(A_X(t)\) is so that \(\displaystyle{\dfrac{p}{n}\sum_{i=1}^{n} R_i(t)
R_i(t)^\top = \dfrac{1}{n} \sum_{i=1}^{n} R_i(t)^\top R_i(t)
I_p}\), and the sgn function is the same as in Section 2.1.
Then for each time \(t\), the pointwise
multivariate extension of the Wilcoxon rank-sum test statistic is
defined as \(\displaystyle{W(t)^{(w)} =
\dfrac{pn}{\sum_{i=1}^{n} R_i(t)^\top R_i(t)} \left[ \ |w| \
||\bar{R}_w(t) ||_2^2 + |w^\mathsf{c}| \ ||\bar{R}_{w^\mathsf{c}}(t)
||_2^2 \ \right]}\) where
\(\displaystyle{\bar{R}_{g}(t) =
\dfrac{1}{|g|} \sum_{i, s_i \in g} R_i(t) \ (g \in \{ w, w^\mathsf{c}
\})}\) .
In the context of cluster detection, the null hypothesis is defined as
\(\mathcal{H}_0\): \(\forall w \in \mathcal{W}, \ \forall t \in
\mathcal{T}, \ F_{w,t} = F_{w^\mathsf{c},t}\) where \(F_{w,t}\) and \(F_{w^\mathsf{c},t}\) correspond
respectively to the cumulative distribution functions of \(X(t)\) in \(w\) and outside \(w\). The alternative hypothesis \(\mathcal{H}_1^{(w)}\) associated with a
potential cluster \(w\) is \(\mathcal{H}_1^{(w)}\): \(\exists t \in \mathcal{T}, \ F_{w,t}(x) =
F_{w^\mathsf{c},t}(x-\Delta_t)\), \(\Delta_t \neq 0\).
Finally Frévent et al. (2021b) proposed to
globalize the information over the time with the quantity \(W^{(w)} = \underset{t \in \mathcal{T}}{\sup} \
W(t)^{(w)}\) and to use it as a concentration index to be
maximized over the set of potential clusters \(\mathcal{W}\). The multivariate rank-based
functional spatial scan statistic (MRBFSS) is then \(\Lambda_{\text{MRBFSS}} = \underset{w \in
\mathcal{W}}{\max} \ W^{(w)}.\)
Once the most likely cluster (MLC) is detected, its statistical significance must be evaluated. The distribution of the scan statistic \(\mathcal{S}\) (\(\mathcal{S}\) = \(\lambda_{\text{MG}}\), \(\lambda_{\text{MNP}}\), \(\Lambda_{\text{PFSS}}\), \(\Lambda_{\text{NPFSS}}\), \(\Lambda_{\text{DFFSS}}\), \(\Lambda_{\text{URBFSS}}\), \(\Lambda_{\text{MPFSS}}\), \(\Lambda_{\text{MDFFSS}}\) or \(\Lambda_{\text{MRBFSS}}\)) is intractable under \(\mathcal{H}_0\) due to the overlapping nature of \(\mathcal{W}\). Then we choose to obtain a large set of simulated datasets by randomly permuting the observations \(X_i\) in the spatial locations. This technique was already used in spatial scan statistics (Martin Kulldorff, Huang, and Konty 2009; Cucala et al. 2017).
Let \(M\) denote the number of
random permutations of the original dataset and \(\mathcal{S}^{(1)},\dots,\mathcal{S}^{(M)}\)
be the observed scan statistics on the simulated datasets. According to
Dwass (1957) the p-value for \(\mathcal{S}\) observed in the real data is
estimated by \[\hat{p} = \dfrac{1 +
\sum_{m=1}^M \mathbb{1}_{\mathcal{S}^{(m)} \ge
\mathcal{S}}}{M+1}.\] Finally, the MLC is considered to be
statistically significant if the associated \(\hat{p}\) is less than the type I
error.
According to Cucala et al. (2019) the MNP method tends to present a better power and higher true positive rates for non-Gaussian data than the MG one. Although the false positive rates are often higher for this approach than the MG one, it remains moderate. The same conclusions are true for the UG (Martin Kulldorff, Huang, and Konty 2009) and the UNP (Jung and Cho 2015) which are their particular counterparts in the case of a single variable. In the functional framework, the approaches that present the best results are the DFFSS and the URBFSS in the univariate context and the MDFFSS and the MRBFSS in the multivariate one (Frévent et al. 2021a, 2021b). The URBFSS and the MRBFSS tend to show higher powers and higher true positive rates although they detect more false positives than the DFFSS and the MDFFSS respectively. Table 1 summarizes the methods and their performances. The symbols \(\checkmark\) and X indicate respectively a high and a low performance on the criterion. If there is no symbol it means that for this criterion the approach offers medium performances. The terminology “localized clusters in time” in the functional cases refers to aggregations of sites that take higher or lower values for the process only in a small interval of time (an interval of five days over a study period of one month for example). Table 2 gives an idea of the computation time of the different scanning methods proposed by the package. It should be noted that the computation time of the different spatial scan statistics methods is dependent on the size of the datasets, and in particular a function of the number of sites, the number of observation times and the number of variables considered.
| Gaussian distribution | Non-Gaussian distribution | ||||||
| Method | Power | True positive rate | False positive rate | Power | True positive rate | False positive rate | |
| Univariate data | Univariate data | ||||||
| UG\(^a\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | X | X | \(\checkmark\) | |
| UNP\(^a\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | |||
| Multivariate data | Multivariate data | ||||||
| MG\(^b\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | X | X | \(\checkmark\) | |
| MNP\(^b\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | |||
| Functional univariate data | Functional univariate data | ||||||
| Non localized clusters in time | Non localized clusters in time | ||||||
| PFSS\(^c\) | \(\checkmark\) | X | \(\checkmark\) | ||||
| DFFSS\(^d\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | |
| NPFSS\(^e\) | \(\checkmark\) | \(\checkmark\) | |||||
| URBFSS\(^f\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | |||
| Localized clusters in time | Localized clusters in time | ||||||
| PFSS\(^c\) | X | X | X | X | |||
| DFFSS\(^d\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | |
| NPFSS\(^e\) | X | X | |||||
| URBFSS\(^f\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | |
| Functional multivariate data | Functional multivariate data | ||||||
| Non localized clusters in time | Non localized clusters in time | ||||||
| MPFSS\(^c\) | \(\checkmark\) | X | X | \(\checkmark\) | |||
| MDFFSS\(^d\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | |||
| NPFSS\(^e\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | ||||
| MRBFSS\(^f\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | |||
| Localized clusters in time | Localized clusters in time | ||||||
| MPFSS\(^c\) | X | X | X | X | \(\checkmark\) | ||
| MDFFSS\(^d\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | |||
| NPFSS\(^e\) | X | X | |||||
| MRBFSS\(^f\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) | \(\checkmark\) |
\(^a\) The Univariate Gaussian (UG)
and the Univariate Nonparametric (UNP) spatial scan statistics
\(^b\) The Multivariate Gaussian (MG)
and the Multivariate Nonparametric (MNP) spatial scan statistics
\(^c\) The Parametric Functional (PFSS)
and the Multivariate Parametric Functional (MPFSS) spatial scan
statistics
\(^d\) The Distribution-free Functional
(DFFSS) and the Multivariate Distribution-free Functional (MDFFSS)
spatial Scan Statistics
\(^e\) The Nonparametric Functional
(NPFSS) spatial Scan Statistic
\(^f\) The Univariate Rank-based
Functional (URBFSS) and the Multivariate Rank-based Functional (MRBFSS)
spatial Scan Statistics
| Method | Computation time (in s) | |||
| Mean | Standard deviation | Minimum | Maximum | |
| Univariate data | ||||
| UG\(^a\) | 4.31 | 0.18 | 4.01 | 4.77 |
| UNP\(^a\) | 3.07 | 0.12 | 2.77 | 3.5 |
| Multivariate data | ||||
| MG\(^b\) | 253.04 | 32.93 | 231.34 | 310.91 |
| MNP\(^b\) | 14.6 | 1.44 | 13.48 | 17.77 |
| Functional univariate data | ||||
| PFSS\(^c\) | 113.51 | 8.08 | 109.47 | 132.58 |
| DFFSS\(^d\) | 55.99 | 4.93 | 52.52 | 66.76 |
| NPFSS\(^e\) | 12.91 | 1.23 | 11.74 | 15.6 |
| URBFSS\(^f\) | 17.93 | 1.37 | 16.93 | 22.53 |
| Functional multivariate data | ||||
| MPFSS\(^c\) | 72.52 | 9.21 | 65.23 | 100.17 |
| MDFFSS\(^d\) | 182.95 | 21.54 | 166.27 | 227.84 |
| NPFSS\(^e\) | 22.95 | 1.7 | 21.77 | 28 |
| MRBFSS\(^f\) | 244.04 | 21.83 | 234.05 | 305.56 |
\(^a\) The Univariate Gaussian (UG)
and the Univariate Nonparametric (UNP) spatial scan statistics
\(^b\) The Multivariate Gaussian (MG)
and the Multivariate Nonparametric (MNP) spatial scan statistics
\(^c\) The Parametric Functional (PFSS)
and the Multivariate Parametric Functional (MPFSS) spatial scan
statistics
\(^d\) The Distribution-free Functional
(DFFSS) and the Multivariate Distribution-free Functional (MDFFSS)
spatial Scan Statistics
\(^e\) The Nonparametric Functional
(NPFSS) spatial Scan Statistic
\(^f\) The Univariate Rank-based
Functional (URBFSS) and the Multivariate Rank-based Functional (MRBFSS)
spatial Scan Statistics
The package HDSpatialScan provides a function
SpatialScan to compute all the spatial scan statistics. The
user chooses the method to apply by specifying the method
argument: "MG" and "MNP" apply respectively
the parametric and nonparametric spatial scan statistics approaches on
multivariate data. Their univariate counterparts (when \(p=1\)) can be computed with
"UG" and "UNP" respectively. Then
"PFSS", "DFFSS" and "URBFSS"
apply the parametric, the distribution-free and the new rank-based
functional approaches on univariate functional data, and
"MPFSS", "MDFFSS", and "MRBFSS"
are their multivariate counterparts. Finally "NPFSS"
applies the nonparametric spatial scan statistic for functional data
developed by Smida et al. (2022) on both
univariate and multivariate functional data.
Depending on the type of approach (univariate, multivariate,
functional univariate or functional multivariate), the data must be
formatted in a specific way. For univariate approaches, the data must be
a vector in which each element corresponds to a site. If the data is
individual and many individuals share the same site, the data can remain
in an individual format with one element of the vector per individual.
Then for real-valued multivariate methods or functional univariate
methods, the data must be a matrix in which each row corresponds to a
site (or an individual) and each column corresponds to a variable or an
observation time in the functional framework. For multivariate
functional methods the data must be a list in which each element is a
matrix corresponding to a site (or an individual). In the matrices, the
rows correspond to the variables and the columns to the observation
times. Note that the observation times must be the same for each site or
individual and they must be equally spaced for the methods
"NPFSS", "PFSS" and "MPFSS".
However if it is not the case of the raw data, they can be easily
transformed by smoothing the data (J. Ramsay and
Silverman 2005), by using for example the R package fda (J. O. Ramsay, Graves, and Hooker 2020).
The most important parameter is the method argument
which has already been presented previously and allows to choose the
spatial scan statistics to be applied. Note that you can choose one or
more methods. Supplying "MPFSS" automatically computes the
four strategies for the multivariate parametric functional spatial scan
statistic (the Lawley-Hotelling trace (LH), the Roy’s largest root (R),
the Pillai’s trace (P) and the Wilks’ lambda (W)). If you only want the
Lawley-Hotelling trace for example, you can simply supply
"MPFSS-LH". Although the Lawley-Hotelling trace test is the
most used statistic (Oja and Randles
2004), it should be noted that all these methods usually provide
very similar results.
The other arguments are data, sites_coord,
system, mini, maxi,
type_minimaxi, mini_post,
maxi_post, type_minimaxi_post,
sites_areas, MC, typeI,
nbCPU, variable_names and times.
Note that nbCPU will be ignored for the methods
"UG" and "UNP", variable_names is
ignored for the univariate and univariate functional scan statistics and
times is ignored for non-functional scan statistics.
The argument data, is the data vector, matrix or list on
which the approaches must be applied. MC and
typeI correspond respectively to the number of permutations
of the data while computing the statistical significance of the clusters
and the type I error i.e. a cluster is declared significant if its
estimated p-value is below this threshold.
The arguments sites_coord and system are
respectively a matrix of two columns corresponding to the coordinates of
each site or individual, and to the system of coordinates (“Euclidean”
or “WGS84”).
The sites_areas argument is optional and corresponds to the
areas of the sites (or the site of each individual if the data is
individual).
The argument nbCPU permits to do parallelization and the
arguments mini, maxi,
type_minimaxi, mini_post,
maxi_post, type_minimaxi_post are described
further below.
variable_names is simply the names of the variables (in the
same order as in the data) for multivariate or multivariate functional
scan statistics and times corresponds to the times of
observation, they must be numeric.
The clusters are computed automatically as circular clusters, so we
need to define a minimum and a maximum size for these clusters. That is
what we call “a priori filtering” and this allows to control
the computation time. Three types of a priori filtering are
possible through the argument type_minimaxi:
"sites/indiv" (the filtering is applied on the number of
sites or individuals in the potential clusters, it is the default
value), "area" (it is applied on the area of the clusters
and is available only if sites_areas is provided), or
"radius" (the radius of the clusters).
The arguments mini and maxi are then
respectively the minimum number of sites/individuals, or the minimal
area or radius and the maximum number of sites/individuals, or the
maximal area or radius. For the radius it is specified in km if
system is "WGS84" or in the user units if
system is "Euclidean".
It should be noted that this filtering can bias the p-values obtained
for the clusters. In order to perform a correct statistical inference,
Martin Kulldorff and Nagarwalla (1995)
recommended to consider a maximum size of half the study region. Thus
the default setting is to consider potential clusters comprising at
least one site and at most 50% of the sites (Equation @ref(eq:cluster)).
If you want to select clusters according to size (number of sites or
individuals), area or radius, it is better to select them a
posteriori among the detected clusters and if you really want to
decrease the computation time we recommend to increase the number of CPU
(with the argument nb_CPU). Changing the default settings
can allow the user to investigate whether there appear to be clusters in
a relatively quick first step, although the inference is biased, before
applying the scan procedure with the default settings for the a
priori filtering (50% of the studied region).
Sometimes after that the p-value of each potential cluster has been
computing, the user may want to retrieve only the significant clusters
that satisfy a certain size, area, or radius criteria. That is what we
call a posteriori filtering. The corresponding arguments are
mini_post, maxi_post and
type_minimaxi_post and their definitions are the same as
mini, maxi and type_minimaxi. If
the user only wants to obtain clusters meeting size criteria, this a
posteriori approach must be prioritized over the a priori
approach which gives biased results and must therefore be used with
great care.
The function SpatialScan returns a list of object of
class "ResScanOutput" which is composed of many elements.
The element sites_clusters is a list in which each element
corresponds to a significant cluster and contains the index of the sites
(or the individuals) included in this cluster. The clusters are listed
in their order of detection. The secondary clusters are defined
according to Martin Kulldorff (1997): they
correspond to potential clusters that also present large values for the
concentration index. Their p-values are calculated as if they were the
most likely cluster themselves which is a bit conservative since the
secondary clusters are compared with the most likely cluster of the
permutations (Martin Kulldorff 1997).
Finally, only clusters that are significant at the typeI
threshold and that do not overlap with a more likely cluster are
returned, and pval_clusters corresponds to the associated
p-values. The element centres_clusters corresponds to the
coordinates of the centres of each detected cluster and
radius_clusters is the radius of the clusters in km if
system is “WGS84” or in the user units otherwise.
areas_clusters corresponds to the areas of the clusters (in
the same units as sites_areas). Finally the system of
coordinates, the coordinates of the sites, the data and the name of the
scan procedure are recalled respectively in the elements
system, sites_coord, data and
method.
Depending on the type of the method (univariate, multivariate,
univariate functional or multivariate functional) the objects of class
"ResScanOutput" are also of class
"ResScanOutputUni", "ResScanOutputMulti",
"ResScanOutputUniFunct" or
"ResScanOutputMultiFunct". The objects of class
"ResScanOutputMulti" and
"ResScanOutputMultiFunct" also include the element
variable_names, and the objects of class
"ResScanOutputUniFunct" and
"ResScanOutputMultiFunct" include the element
time.
It is possible to plot the detected clusters by using the classical
plot function. Depending on the type
parameter, the package HDSpatialScan provides three different
types of plot.
The first one, "map", allows the user to plot a map of
the sites and draws the circles corresponding to the circular clusters.
The second one, "map2", plots the clusters in colors. For
these two types of plot the argument spobject which is the
spatial object corresponding to the sites, must be provided. If you do
not have this object you can use the third type "schema"
which simply draws a schema of the sites and the clusters, with the
argument system_conv which allows to correctly project the
coordinates. It must be entered as in the PROJ documentation (PROJ contributors 2021).
One may also want to get some features of one’s clusters.
The function summary allows to get a summary of the
clusters, either the mean and the standard deviation of each of the
variables (if many) if the argument type_summ is
"param", or the 25th percentiles, the medians and the 75th
percentiles if the argument type_summ is
"nparam". This function also provides the p-values, the
radius and the area if available (only if sites_areas is
provided) for each cluster detected.
Other interesting functions are plotCurves that allows to
display cluster curves (only in the functional case), and
plotSummary which displays the average (if
type = "mean") or the median (if
type = "median") curves in the clusters, outside and the
global mean or median curves in the functional case. For the
multivariate non-functional framework it displays a spider chart of
means or medians for each variable inside the cluster, outside, or in
all the area. Note that all these functions take an argument
only.MLC which allows to only plot or summarize the most
likely cluster (by setting only.MLC = TRUE). Finally the
print function shows the scan procedure used as well as the
number of clusters detected and their p-value.
To show the simplicity of use of the package, we will apply the different approaches on the environmental data provided in the package. It should be noted that the codes presented in this section represent a total computation time of about one hour on a regular laptop, using 7 cores.
We considered data provided by the French national air quality
forecasting platform PREV’AIR which is available in the package
HDSpatialScan. This lattice data consists in the daily
concentrations (from May 1, 2020 to June 25, 2020) in \(\mu g.m^{-3}\) of four pollutants for each
of the 169 cantons (administrative subdivisions of France) of
the Nord-Pas-de-Calais (a region in northern France)
characterized by spatial polygons and located by their center of gravity
\(s_1, \dots, s_{169}\): nitrogen
dioxide (\(\text{NO}_2\)), ozone (\(\text{O}_3\)) and fine particles \(\text{PM}_{10}\) and \(\text{PM}_{2.5}\) corresponding
respectively to particles whose diameter is less than \(10 \mu m\) and \(2.5 \mu m\). The package
HDSpatialScan provides the full data: fmulti_data
but also some reduced data for the univariate functional case which
consists in considering only the \(\text{NO}_2\) concentrations
(funi_data), and for the multivariate non-functional
framework (multi_data) which corresponds to the temporal
mean concentrations of the four pollutants over the study period.
The first step is to load the data:
library(HDSpatialScan)
data("map_sites")
data("multi_data")
data("funi_data")
data("fmulti_data")
The second step is to visualize the pollutants daily concentration curves in each canton and the spatial distributions of the temporal mean concentrations for each pollutant over the studied time period (Figures 5 and 6). This step allows us to see if sites seem to aggregate and therefore if launching a cluster detection is relevant, and if a temporal variation of the concentrations is visible, in which case a functional method will be more relevant than a multivariate approach summarizing each curve by its mean.
The maps in Figure 6 show a spatial heterogeneity of the average concentration for each pollutant. Thus spatial scan statistics seem to be suitable to highlight the presence of cantons-level spatial clusters of pollutants concentrations. Moreover since the curves in Figure 5 show a marked temporal variability during the period from May 1, 2020 to June 25, 2020 a functional approach is more appropriate. However for sake of completeness we will also perform a multivariate spatial scan statistic approach anyway. Since small clusters of pollution are more relevant for interpretation because the sources of the pollutants are very localized, we will consider an a posteriori filtering of maximum radius equal to 10 km.
First we will investigate a multivariate spatial scan statistic. In
this example the temporal component of the multivariate functional data
was suppressed by averaging the components over the time and we looked
for spatial clusters of the combination of the different air pollutants.
This will pick up areas of multiple exposure to pollutants or, on the
contrary, areas with little pollution. We first checked the normality of
each variable, which has been done by using a histogram and a qqplot.
Since the distribution of the pollutants temporal mean concentrations is
non-normal we decide to apply the MNP scan procedure. Here the system of
coordinates is “WGS84”, it must be filled with the argument
system. As explained in Section 3.1, (Martin
Kulldorff and Nagarwalla 1995) recommended to consider a maximum
size of half the study region for the potential clusters so we use this
a priori filtering with the parameters mini,
maxi and type_minimaxi: the potential clusters
are circular and they contain between 1 and 50% of the sites. Then as
noticed in Section 4.1, we will apply
an a posteriori filtering of maximum radius equal to 10 km
(arguments mini_post, maxi_post and
type_minimaxi_post). Here we only want to consider the
significant clusters at the 5% threshold. Thus we leave the
typeI parameter at its default value (0.05). However it
should be noted that it is possible to obtain all the clusters (the MLC
and the secondary clusters (Martin Kulldorff
1997)) by setting the typeI value at 1.
library(sp)
coords <- coordinates(map_sites)
res_mnp <- SpatialScan(method = "MNP", data = multi_data, sites_coord = coords,
+ system = "WGS84", mini = 1, maxi = nrow(coords)/2, type_minimaxi = "sites/indiv",
+ mini_post = 0, maxi_post = 10, type_minimaxi_post = "radius",
+ nbCPU = 7, MC = 99, variable_names = c("NO2", "O3", "PM10", "PM2.5"))$MNP
Once the scan procedure is completed, the plot function can be used.
For brevity, we only focus on the MLC and for the sake of completeness
we will show the use of the three possible visualizations of the
clusters. Since we have a spatial object map_sites we can
use the types "map" and "map2". However for
sake of completeness we also show the use of "schema" which
allows to display the clusters otherwise (Figure 7). For the latter, since the system of the
coordinates is “WGS84”, the plot function requires to complete the
parameter system_conv which allows to correctly project the
points. Here we choose the EPSG code 2154 corresponding to the Lambert
93 projection since the data is located in metropolitan France.
plot(x = res_mnp, type = "map", spobject = map_sites, only.MLC = TRUE)
plot(x = res_mnp, type = "map2", spobject = map_sites, only.MLC = TRUE)
plot(x = res_mnp, type = "schema", system_conv = "+init=epsg:2154", only.MLC = TRUE)
plot with the types “map” (panel a),
“map2” (panel b) and “schema” (panel c) for
the MNP scan procedure computed with the function
SpatialScan on the average concentrations of NO2, O3, PM10 and PM2.5. The first two types of
visualization require a spatial object corresponding to the spatial
units (here the 169 cantons of the Nord-Pas-de-Calais
region of northern France). The visualization option
“schema” allows in the other case to draw a schema of the
spatial units and the detected clusters. Here we focus the cluster
visualization on the most likely cluster which is located in the urban
area of Lille.
Finally users may want to get some summarized characteristics, such
as the quantiles of the variables. This can be achieved by using the
function summary with the argument type_summ
equal to "nparam" (for the quantiles):
summary(res_mnp, type_summ = "nparam", only.MLC = TRUE)
## $basic_summary
## Cluster 1
## p-value 0.001
## Radius 9.999
##
## $complete_summary
## Overall Inside cluster 1 Outside cluster 1
## Number of sites 169.000 12.000 157.000
## Q25 NO2 8.673 11.327 8.635
## Median NO2 9.183 11.721 9.075
## Q75 NO2 9.848 12.382 9.692
## Q25 O3 66.778 67.527 66.721
## Median O3 67.895 67.609 67.961
## Q75 O3 68.564 67.922 68.658
## Q25 PM10 16.397 17.483 16.205
## Median PM10 16.970 17.877 16.933
## Q75 PM10 17.372 17.962 17.266
## Q25 PM2.5 9.132 10.584 9.113
## Median PM2.5 9.833 10.678 9.790
## Q75 PM2.5 10.213 10.919 10.107
The user can also use the function plotSummary to
display the spider chart corresponding to the detected cluster
(Figure 8).
plotSummary(res_mnp, type = "median", only.MLC = TRUE)
plotSummary for the most likely cluster detected
on the average concentrations of \(\text{NO}_2\), \(\text{O}_3\), \(\text{PM}_{10}\) and \(\text{PM}_{2.5}\) by the MNP scan procedure
in northern France. The most likely cluster is characterized by larger
median concentrations of \(\text{NO}_2\), \(\text{PM}_{10}\) and \(\text{PM}_{2.5}\) than outside the
cluster.The MLC is located in the area of Lille. The summary and Figure 8 show that it is a cluster of overpollution (except for the pollutant \(\text{O}_3\)). This cluster is especially characterized by high concentrations of \(\text{NO}_2\) and \(\text{PM}_{2.5}\) which indicates pollution from road traffic and from the residential sector (auxiliary heating in particular). As the adverse health effects of air pollution (and their potential synergistic effect) are well established, such a result could inform local stakeholders about immediate interventions around the area of Lille to reduce the air pollution levels.
We have obtained some first results however the curves on Figure 5 present a marked temporal variability during the study period. Thus it could be interesting to apply functional spatial scan statistics.
Here we only consider the pollutant \(\text{NO}_2\). Applying a spatial scan
statistic for univariate functional data will thus allow to highlight
areas where the \(\text{NO}_2\)
concentration curves are abnormally high or, on the contrary abnormally
low. We choose to use the URBFSS scan procedure since it often presents
higher powers and true positive rates than the other univariate
functional methods as its multivariate counterpart MRBFSS (Frévent et al. 2021b). As mentioned in
Section 4.2 we decide to use the set of
potential clusters a priori in the Equation @ref(eq:cluster)
which corresponds to the recommended approach of (Martin Kulldorff and Nagarwalla 1995), and to
the default values of the parameters mini,
maxi and type_minimaxi in the scan functions.
We also set a maximum radius equal to 10 km a posteriori.
res_urbfss <- SpatialScan(method = "URBFSS", data = funi_data, sites_coord = coords,
+ system = "WGS84", mini = 1, maxi = nrow(coords)/2, type_minimaxi = "sites/indiv",
+ mini_post = 0, maxi_post = 10, type_minimaxi_post = "radius",
+ nbCPU = 7, MC = 99)$URBFSS
plot(res_urbfss, type = "map2", spobject = map_sites, only.MLC = TRUE)
plot with type = "map2" for
the URBFSS scan procedure computed using the concentrations of \(\text{NO}_2\) in the 169 cantons
of the Nord-Pas-de-Calais region of northern France over the
period from May 1, 2020 to June 25, 2020. Here we focus the cluster
visualization on the most likely cluster which is located in the urban
area of Lille.Again the MLC is located in the area of Lille (Figure 9).
For functional data another function is provided to give some
characteristics of the clusters: we can visualize the curves in the
cluster by adding the curve of the global median with the function
plotCurves. The function plotSummary allows to
visualize the median curves inside and outside the cluster (Figure 10): this is a cluster of overexposure to \(\text{NO}_2\), which indicates
traffic-related air pollution. Since exposure to pollution impacts
health negatively, these results can be used to intervene to reduce air
pollution.
plotCurves(res_urbfss, add_median = TRUE, only.MLC = TRUE)
plotSummary(res_urbfss, type = "median", only.MLC = TRUE)
plotCurves (left panel)
and plotSummary (right panel). The most likely cluster is
characterized by high concentration curves with a median concentration
curve higher than outside the cluster.
Now we consider the four pollutants together. To detect spatial clusters of the combination of the four pollutants considering all available information on the time period, we apply a spatial scan statistic for multivariate functional data. It will identify geographical areas in which one or more of the pollutant concentration curves are abnormally high or abnormally low. For the same reason that we have previously chosen to apply the URBFSS scan procedure, we use the MRBFSS in this context, with the same restrictions a priori and a posteriori as for the MNP and the URBFSS scan approaches.
res_mrbfss <- SpatialScan(method = "MRBFSS", data = fmulti_data, sites_coord = coords,
+ system = "WGS84", mini = 1, maxi = nrow(coords)/2, type_minimaxi = "sites/indiv",
+ mini_post = 0, maxi_post = 10, type_minimaxi_post = "radius",
+ nbCPU = 7, MC = 99, variable_names = c("NO2", "O3", "PM10", "PM2.5"))$MRBFSS
plot(res_mrbfss, type = "map2", spobject = map_sites, only.MLC = TRUE)
plot with
type = "map2" for the MRBFSS scan procedure computed using
the concentrations of \(\text{NO}_2\),
\(\text{O}_3\), \(\text{PM}_{10}\) and \(\text{PM}_{2.5}\) in the 169
cantons of the Nord-Pas-de-Calais region of northern
France over the period from May 1, 2020 to June 25, 2020. The most
likely cluster is located in the urban area of Lille.The detected cluster is exactly the same as before and is therefore located in the urban area of Lille (Figure 11).
Again we will display the curves in the cluster by adding the curve of the global median (Figure 12), as well as the median curves inside and outside the cluster which show that this is a cluster of high concentrations of \(\text{NO}_2\), \(\text{PM}_{10}\) and \(\text{PM}_{2.5}\) (Figure 13). As mentioned in Section 4.2, in environmental science it is well-known that \(\text{NO}_2\) and \(\text{PM}_{2.5}\) are more frequent in urban areas due to road traffic and population density so this is consistent with the cluster observed here. As the adverse health effects of air pollution and the combined effects of air pollutants are well established, this result could enable interventions by local authorities around the Lille area to reduce air pollution.
plotCurves(res_mrbfss, add_median = TRUE, only.MLC = TRUE)
plotSummary(res_mrbfss, type = "median", only.MLC = TRUE)
plotCurves. The most likely cluster is
characterized by high concentration curves of \(\text{NO}_2\), \(\text{PM}_{10}\) and \(\text{PM}_{2.5}\).plotSummary. The most likely cluster is
characterized by median concentration curves of \(\text{NO}_2\), \(\text{PM}_{10}\) and \(\text{PM}_{2.5}\) higher than outside the
cluster.In this article we presented the HDSpatialScan package. It
makes it very easy to apply the existing scan statistics developed for
multivariate data or functional data (univariate or multivariate), and
the new rank-based scan statistic for univariate functional data
presented in the Section 2. The potential
clusters considered are of variable size and circular. In further
updates of the package HDSpatialScan other shapes of scanning
window such as elliptical or rectangular shapes will be implemented. Our
package also allows to easily plot and summarize the detected clusters.
Then examples of applications of the functions of the package have been
shown. HDSpatialScan presents the advantage that all the scan
procedures are applied using the same function SpatialScan
and it uses the classical R functions plot,
print and summary which makes it very quick to
get started.