Abstract
This paper introduces a spatial point density algorithm designed to be explainable, meaningful, and efficient. Originally designed for military applications, this technique applies to any spatial point process where there is a desire to clearly understand the measurement of density and maintain fidelity of the point locations. Typical spatial density plotting algorithms, such as kernel density estimation, implement some type of smoothing function that often results in a density value that is difficult to interpret. The purpose of the visualization method in this paper is to understand spatial point activity density with precision and meaning. The temporal tendency of the point process as an extension of the point density methodology is also discussed and displayed. Applications include visualization and measurement of any type of spatial point process. Visualization techniques integrate ggmap with examples from San Diego crime data.The pointdensityP package is an example of a tailored package that was created to visualize and quantify spatiotemporal point data (Evangelista and Beskow 2018). The innovation in the presented methods lies in the implementation, simplification, and general improvements over other methods. Although designed for military analysis, the presented analysis and visualization methods work for any discrete point process that contains a spatial dimension. The point density method in this paper calculates event frequency intensity, or density, within the neighborhood of a given point, accounting for geospatial projection. Additionally, if the point process includes a temporal dimension, the temporal tendency can also be calculated. The temporal tendency is the average age of events within the neighborhood of each point, providing perspective on density trends over time.
The purpose of any type of map projection is simple—to understand a phenomenon within the context of our physical environment. The impetus of the work within this paper resulted from difficult-to-explain output from common spatial density or “hot spot” visualization techniques. Many of these spatial density approaches use a methodology that would be appropriate for a continuous valued phenomenon but does not apply well to discrete phenomenon. Many of the existing density visualization techniques apply a smoothing function that depicts activity where there should be no activity, essentially spreading the density across the area of interest without spatial specificity. Furthermore, many of the density approximation techniques, such as kernel density estimation, create an output value that is not easily interpreted. All too often the output simply provides a relative comparison, with colors representing values that indicate a higher or lower density without providing meaningful quantification of the measured phenomenon.
Although the claimed merits of the methods in this paper include simplicity and ease of understanding, there are two aspects of this algorithm that may not be immediately apparent without describing in detail. The first involves the method of discretization and geospatial projection used to manage and measure the density. And the second aspect involves the data management necessary to implement this discretization and efficiently calculate the density. Two data management practices will be discussed: the original use of hash data structures and the recent use of matrix-based data structures.
The pointdensity() algorithm applies elementary geometry
with a geospatial projection in order to build the point density
measurements. Figure 2 shows how the
neighborhood radius, \(r\), is
projected across a square grid. Geographers refer to this as a
cylindrical or mercator projection where longitudinal lines appear to
remain equidistant. The mercator projection has been used for the sake
of illustration, however the algorithm uses a spherical projection and
great circle distances to achieve maximal accuracy. As previously
mentioned, the algorithm applies a simple binning rule that
significantly reduces computations. Density is only collected at the
intersection of the grid lines shown in the figure, referred to as grid
points. Assume that the center of the circle represents the grid point
closest to an event or binned collection of events. The density of every
grid point within \(r\) will increase
by the amount of density associated with the center grid point. The
algorithm iterates north and south within \(r\) along the latitudes of the grid and
calculates the tolerance, \(t\), of
longitudinal grid lines that are within \(r\). The spherical Pythagorean theorem is
used to find \(t\), providing an
accurate binning tolerance irrespective of location on the globe (Veljan 2018). This binning tolerance ensures
that only grid points within \(r\)
increase in density.
Figure 3 shows a toy example of the point density method examining only three points. This image shows points A, B, and C, and the radius is four grid steps. Assume that these three points are near the equator and the mercator projection is a reasonable representation of true distance to scale. This figure shows the final density count for every point within the neighborhood radius of the three events.
The original hash-based data structure approach is shown with
pseudo-code in algorithm 1. The
interested reader can find the implementation of the hash-based
algorithm in pointdensityP version 0.2.1. Discussing the
original use of hash data structures for the pointdensity()
algorithm provides two insights: the hash-based approach provides an
accessible explanation of the pointdensity() computational
approach; and secondly, a comparison of the hash-based method to the
matrix-based method illustrates a significant difference in computing
time when using hash-based data retrieval compared to matrix-based data
retrieval. Hash data structures, or associative arrays, are ubiquitous
in programming, but considered somewhat of a late arrival to R from a
programming standpoint. However, many have pointed out that R’s native
environment data structures are inherently a hash data structure and can
be leveraged as such. (Brown 2013) wrote
an R package specifically designed to “fully implement” hash data
structures in R. The use of hash data structures for the
pointdensity() algorithm reduced unneeded computations and
stored geographic data only within the neighborhood radius of event
points. While the use of hash data structures improved the computational
complexity from \(O(Gm)\) to \(O(n(2r)^2)\), the use of matrices and
carefully indexed data proved to be an even faster solution. Both
computing methods are explained in this paper.
Use of the hash-based data structure achieved desired results, reducing computational complexity while returning an adequate approximation of density. However, the speed of the computation was not satisfactory for larger datasets. While retaining the binned density concept, the algorithm was reduced to one loop of length \(= n\) event points using matrix and array facilities. This latter approach will be referred to as the matrix-based approach. The implementation of the matrix-based approach can be found in pointdensityP version 0.3.4.
The general process of the pointdensity() algorithm
reduces a list of \(n\) points to a
list of \(m\) grid points on a mesh.
The calculation of density and temporal tendency remains relative to the
mesh. Once the overall pointdensity() algorithm is
complete, it is necessary to assign the density and temporal tendency
back to the original list of data. The hash-based algorithm creates
intuitive data structures and data retrieval to relate the list of \(n\) points to the list of \(m\) grid points. Unfortunately, the
retrieval of information from the hash tables proved increasingly time
consuming as the datasets grew. In order to overcome this problem while
continuing to manage the requirement to relate \(n\) original points to \(m\) grid points, a careful sort and index
process, using matrix-based facilities and the data.table
package, replaced the hash-based storage and retrieval method (Dowle and Srinivasan 2017). A brief explanation
of the improved computing approach follows.
The original list of \(n\) points, sorted by their latitude and longitude, reduces to \(m\) points once every point is rounded to the nearest point on the mesh. All of the aggregation and sorting leverages functions from the data.table package. For each of the \(m\) points on the mesh, an initial density of one or greater exists. This initial density is retained through the algorithm. After calculating the final density for each point, this initial density is used to re-expand the list to the original size of \(n\) points, in the original order, to return the final density count and temporal tendency.
Each of the \(m\) points processes
through a density calculation function. The function to calculate the
local density surrounding an event point is named
calc_density() in the pointdensityP package. The
function starts with enumeration of all latitudes within the radius of
the event point, annotated as lati and loni.
These latitudes are stored in a vector:
lat.vec <- seq(lati - radius * grid_size, lati + radius * grid_size, grid_size)
The variable radius represents the radius measured in
grid steps, and grid_size represents the size of the grid
measured in degrees latitude. lati is the latitude of the
grid point located closest to the event point. lat.vec is
now an array of each latitude contained in the neighborhood. For each
latitude, there is a longitudinal distance that represents the tolerance
of the radius. This tolerance is computed using the spherical
Pythagorean theorem, as previously discussed. lat.vec.t
will contain the longitudinal tolerance for each latitude:
lat.vec.t <- acos(cos(radius * grid_size) / cos(lat.vec - lati))
lat.vec.t <- lat.vec.t / cos(lat.vec * 2 * pi / 360)
lat.vec.t <- round(lat.vec.t / grid_size, 0) * grid_size
The final line of code above rounds the tolerance to the nearest
longitude grid line on the mesh. The calculation of lon.vec
occurs in the same manner as lat.vec, which is then further
expanded to a matrix, lon.mat. This matrix contains the
longitudinal value of every point within a square mesh that includes
(2 * radius + 1)^2 points, with the event point at the
center. After subtracting lati from this matrix, the
absolute value of the matrix then contains the distance between the
longitude of the event point and the longitude of every point in the
mesh. The final subtraction of lat.vec.t then returns a
matrix that indicates all points outside of the radius as a negative
value. After setting all negative elements to zero and all positive
values to 1, the Hadamard product of this binary matrix and the original
matrix of longitudinal values contained in lon.mat returns
a matrix that contains a longitudinal value for every point within the
radius and a value of 0 for every grid point outside of the radius.
Combining these non-zero longitudinal values with the latitude vector
returns a list of points within the radius of the event grid point. The
final step of the local density calculation involves assigning the
temporal value, or sum of temporal values if there is more than one
event that rounds to the event grid point, and assigning the original
density of the event point to every grid point in the neighborhood.
These steps distribute the density and the temporal value of the event
point across the neighborhood, completing function
calc_density() and creating a list of grid points prepared
for aggregation without processing a for-loop in R.
As each of the \(m\) points
processes through calc_density(), the list of returned
points adds to a data.table object which is then aggregated,
summing both the density and temporal value for each unique grid point
in the data.table object. After processing each of the \(m\) points and aggregating the returned
results, every grid point with a positive density exists in the
data.table object. The \(m\)
grid points associated with one of the original \(n\) points are retained and re-sorted into
their original order, and all other points are discarded. The initial
density of every grid point is then added as an additional column to the
data.table, which is then expanded to length \(n\) with the following command:
nfinal <- final.1[rep(seq(1, nrow(final.1)), final.1$yy_count)]
The data.table final.1 contains the grid point
location, density, and sum of temporal values. The field
yy_count contains the number of original points that were
associated with the event’s grid point location. After this expansion, a
sorted list of \(n\) events exists that
includes the density and temporal sum of every original point. The
temporal sum is then divided by the density. The
pointdensity algorithm finally returns every original point
location, a density value, and the temporal tendency of the point.
Figure 4 displays the calculated density for the San Diego crime data (San Diego Regional Data Library 2015). Visualizing only the points in the location where the crime occurred provides spatial specificity, and the coloring provides an explainable measure of the density. Figure 4 results from the following commands:
SD_density <- pointdensity(df = SD_sub, lat_col = "lat", lon_col = "lon",
date_col = "date", grid_size = 0.1, radius = 1)
SD_density$count[SD_density$count > 10000] <- 10000
## creates discriminating color scale
map_base + geom_point(aes(x = lon, y = lat, colour = count), shape = 16, size = 2,
data = SD_density) + scale_colour_gradient(low = "green", high = "red")
The difference in speed between the hash-based approach and the newer matrix-based approach was significant. The matrix-based approach eliminates the need for information retrieval from large hash tables, the most expensive aspect of the hash-based approach. Table 1 compares pointdensityP version 0.2.1 (hash-based) against version 0.3.2 (matrix-based).
| data records | 10 | \(10^2\) | \(10^3\) | \(10^4\) | \(10^5\) | \(10^6\) |
| hash-based time (s) | 0.06 | 0.14 | 1.19 | 10.60 | 265.65 | 857.15 |
| matrix-based time (s) | 0.03 | 0.15 | 1.19 | 7.97 | 33.35 | 74.88 |
A byproduct of the original pointdensity() algorithm was
the concept of computing the temporal tendency for the events in the
neighborhood of every event. Given a set of \(n\) events with a spatial and temporal
dimension, it is possible to calculate an average event date for the
entire dataset or any subset of the data. This is an elementary
calculation, however when combined with the neighborhood aggregation
approach used in the pointdensity() algorithm, it is
possible to expose previously undetected trends and patterns.
Figure 5 shows a toy example of two time series plots that span an equivalent period. Clearly the plot on the left will have a smaller temporal tendency due to the heavier density in the earlier periods. This is unarguably a simplistic measure for detecting differences in trends, however it has proven to be a reliable technique in practice.
The temporal tendency represents the average age of all events within the neighborhood. This measurement also creates a naive measurement for the relative temporal trend, increasing or decreasing, in the neighborhood. A neighborhood with an increasing trend will have a more recent temporal tendency since a higher density of events occurred more recently. A neighborhood with a decreasing trend will have an older temporal tendency, representing a higher density of events occurring earlier in the time period.
The spatial patterns that emerge from the temporal tendency projection provide the most compelling aspect of the temporal tendency calculation. Some tuning of the color scale is necessary to create visual discrimination. The following lines of code create the underlying temporal tendency plot and histograms shown in Figure 6:
SD_temp_tend <- SD_density[SD_density$date_avg > 0]
SD_temp_tend$dateavg[SD_temp_tend$dateavg < 14711] <- 14711
SD_temp_tend$dateavg[SD_temp_tend$dateavg > 14794] <- 14794
map_base + geom_point(aes(x = lon, y = lat, colour = dateavg), shape = 16, size = 2,
data = SD_temp_tend) + scale_colour_gradient(low = "green", high = "red")
Histograms result from the following:
x <- as.Date(SD$date)
hist(x, "weeks", format = "%d %b %y", freq = TRUE)
mid_city <- subset(SD, lat > 32.69 & lon > -117.14 & lat < 32.79 & lon < -117.08)
x <- as.Date(mid_city$date)
hist(x, "weeks", format = "%d %b %y", freq = TRUE)
encinitas <- subset(SD, lat > 33 & lon > -117.32 & lat < 33.09 & lon < -117.27)
x <- as.Date(encinitas$date)
hist(x, "weeks", format = "%d %b %y", freq = TRUE)
Figure 6 shows the results of temporal tendency projection for the San Diego crime data. Regions of disparate activity become apparent with this projection, providing an opportunity to detect areas that are “heating up” compared to areas that are “cooling down”. The overall San Diego crime dataset shows an aggregate decreasing trend for 2007 - 2012. Box 1, Encinitas, shows a slightly increasing trend. Mid-City is a region with one of the strongest decreasing trends.
Simple linear regression is a common method to detect trends. The following code creates a linear model and summary of the model’s utility for any of the histograms discussed above:
res <- hist(x, "weeks", format = "%d %b %y", freq = TRUE)
xres <- res$breaks[1:(length(res$breaks) - 1)]
summary(lm(res$counts ~ xres))
The simple linear regression estimate for the slope coefficient parameter was significant for each of the three regions explored. The map and histograms in Figure 6 highlight the two sub-regions analyzed, Encinitas and Mid-City, as well as the San Diego super-set region that includes all of the data. The trends for each of these regions are apparent from the histograms, and the model utility test results shown in Table 2 for each of these regions shows that there is statistically significant evidence that the slope coefficients are not zero, indicating that a trend exists. This test provides both evidence of the existence of a trend and also quantifies the expected rate of change. The San Diego crime rate in its entirety is dropping by about one crime every two days during this period. Encinitas, lit up in red on the map, shows a relatively more recent temporal tendency when compared to crime across the entire San Diego region, indicating that the rate of crime change in Encinitas will be increasing at a greater rate when compared to the entire San Diego region. The Encinitas histogram shows evidence of a slight increase, and the model utility test for the slope coefficient provides evidence that the slope is greater than zero. On the opposite end of the spectrum, Mid-City averages a crime rate decrease that drops by 1 crime every 10 days. The green coloring on the map clearly highlights an older temporal tendency in this region, which is reinforced by the histogram and regression analysis.
| Estimate | Std. Error | t statistic | Pr(>|t|) | |
| San Diego Total | -0.5077 | 0.0204 | -24.89 | <2e-16 |
| Encinitas | 0.0023 | 0.0008 | 2.88 | 0.00435 |
| Mid-City | -0.0958 | 0.0037 | -25.81 | <2e-16 |
Realize that the temporal tendency calculation is sensitive in areas with a low density. When visualizing the map-based plot, it is possible to compare regions with minimal density with regions that have a much more significant density. For this reason, it is often advisable to use the temporal tendency visualization alongside the point density visualization and possibly restrict the temporal tendency to areas with a minimum density threshold.
Both the temporal tendency projection and point density projection help illustrate regions of disparate activity. These are regions where event behavior is significantly different, oftentimes warranting investigation into underlying causes. The temporal tendency analysis provides a useful perspective on a commonly elusive dimension of spatio-temporal data: change. The most important insights involving spatio-temporal phenomenon commonly relate to change over time.
Future directions for this work include the calculation, exploration, and visualization of other summary statistics that summarize spatio-temporal behavior. For example, estimates for the variance and skew of the temporal behavior readily extends from the aggregation of the squared and cubed temporal values. Analysis of these measurements could provide better fidelity on trends. Lastly, automating the detection of boundaries of regions of disparate activity could provide significant contributions in identifying cause and effect of spatio-temporal behavior.
The authors would like to express appreciation for the thoughtful comments and review of this article by the editors of the R Journal and Ian Irmischer. The opinions expressed in this article are the authors’ own and do not reflect the view of the United States Army or the United States government.