Comparing Hot Spots with Square and Hex Aggregation

The other day I was asked how a hot spot analysis would change if it was generated using points aggregated to a square grid versus points aggregated to a hexagon grid. One of the key differences between hexagons and squares is how the nearest neighbour is calculated. With a square grid only the cardinal directions share a boundary with the target cell and they are all located one cell away. The next nearest cells only touch in the corner and the distance would be √2. Looking at a hexagon grid every cell shares a boundary with the target cell and in every direction a cell is one away.

Comparison_Neighbours

Nearest neighbours comparison between hexagon and square grids.

To investigate this a little further using the tools available in ArcGIS Pro I created a point file covering parts of California. I created the points so that some areas would have higher point densities relative to the rest of the study area. I am only going to be investigating point counts to keep this as simple as possible.

I used the Create Tessellation tool to create both a square and a hexagon grid. Here I wanted to keep the grid area the same, however I noticed that when I put 50 km^2 in for both a square and a hexagon grid the areas of the polygons did not match. The hexagon grid size lined up closely with the input area value I used, however the square grids were larger than my input. I calculated a quick scaling factor and recreated the grid. After checking the areas for the two cell sizes were much closer.

Hex Area49999999.78
Original Square Area57735026.83
Modified Square Area50000000.95

Comparing Aggregation Between Hexagons and Squares

The next step was to use the Summarize Within tool to aggregate the points to the polygon layers. I did this for both the hexagon and square layer. I then used the Python window to check that the aggregation was correct and I had the same number of points in both files. I loaded the original points and summarize within result files to numpy arrays. I then used the numpy functions to quickly get a few summary values. I was able to show that the aggregation was a success and both had counted 23,000 points. 

import arcpy
import numpy as np


points_in = r"C:\Users\ryan\Documents\Ryan\GIS\HotspotCompare\HotSpotCompare.gdb\CACensus2000_T1"
hex_in = r"C:\Users\ryan\Documents\Ryan\GIS\HotspotCompare\HotSpotCompare.gdb\HexTess_T1_Sum"
square_in = r"C:\Users\ryan\Documents\Ryan\GIS\HotspotCompare\HotSpotCompare.gdb\SquareTess_T1_Sum"

points_arr = arcpy.da.FeatureClassToNumPyArray(points_in, ('CID'))
points_length = len(points_arr) # number of points in the input file

hex_arr = arcpy.da.FeatureClassToNumPyArray(hex_in, ('Point_Count'))
hex_arr = np.array(hex_arr).astype(np.float)
square_arr = arcpy.da.FeatureClassToNumPyArray(square_in, ('Point_Count'))
square_arr = np.array(square_arr).astype(np.float)

hex_count = hex_arr.sum()
square_count = square_arr.sum()

print("There are {0} points in the point file, {1} points in the hexagon aggregation and {2} points in the square aggregation".format(points_length, hex_count, square_count))
#There are 23000 points in the point file, 23000.0 points in the hexagon aggregation and 23000.0 points in the square aggregation
print(hex_count == square_count) # check that we have the same number of values
#True

hex_arr.mean()
# 8.8871715610510051
square_arr.mean()
# 8.9355089355089348
hex_arr.std()
# 31.183935138170636
square_arr.std()
# 31.747261532223131
hex_arr.max()
#798.0
square_arr.max()
# 885.0

 

I also looked at the mean and standard deviation for both the square and hexagon aggregation files. These values are a little bit different between the two methods but pretty close overall. Since the area was the same for both methods the size of the polygons shouldn’t be playing a large part in the difference. Depending on where the grid starts and how the points fall into it will change the results however. Below is an example comparing the two grid overlays in an area with a high density of points.

 MeanStandard Dev.Max
Hexagon8.88731.183798
Square8.93531.747885

Comparison_Point

The following figure compares the results of the aggregation over the study area. I used graduated colours and Jenks natural breaks to create the interval scheme on the hexagon layer, and then applied the same breaks to the square layer. The comparison should be pretty close to apples to apples, however as noted above the maximum number of points per cell is higher using the square grid.

Comparison_Category

Comparing the Optimized Hot Spot Analysis

The next step was to run the optimized hot spot analysis tool to compare the results using the square and hexagon grids. I ran the tool using the point counts as the analysis field. The hot spot analysis tool generates new fields in the output table including GiZScore, GiPValue and Gi_Bin. I again used python and numpy to generate summary statistics for these fields.

## Looking at hot spot results


hex_hot = r"C:\Users\ryan\Documents\Ryan\GIS\HotspotCompare\HotSpotCompare.gdb\HexTess_T1_Hot"
square_hot = r"C:\Users\ryan\Documents\Ryan\GIS\HotspotCompare\HotSpotCompare.gdb\SquareTess_T1_Hot"

hex_hot_arr = arcpy.da.FeatureClassToNumPyArray(hex_hot, ('GiZScore', 'GiPValue', 'Gi_Bin'))
square_hot_arr = arcpy.da.FeatureClassToNumPyArray(square_hot, ('GiZScore', 'GiPValue', 'Gi_Bin'))

def return_summary(prefix, in_array_field):
  valmin = in_array_field.min()
  valmax = in_array_field.max()
  mean = in_array_field.mean()
  std = in_array_field.std()
  out_string = "{4} -- Min: {0} Max: {1} Mean: {2} Std: {3}".format(valmin, valmax, mean, std, prefix)
  return out_string

print(return_summary('GiZScore Hexbin', hex_hot_arr['GiZScore']))
print(return_summary('GiPValue Hexbin', hex_hot_arr['GiPValue']))
print(return_summary('Gi_Bin Hexbin', hex_hot_arr['Gi_Bin']))

print(return_summary('GiZScore Square', square_hot_arr['GiZScore']))
print(return_summary('GiPValue Square', square_hot_arr['GiPValue']))
print(return_summary('Gi_Bin Square', square_hot_arr['Gi_Bin']))

# GiZScore Hexbin -- Min: -1.3519384875223115 Max: 19.2595936282973 Mean: 0.04199593961031467 Std: 2.4176901664624966
# GiPValue Hexbin -- Min: 0.0 Max: 0.9976995535094042 Mean: 0.41308276187759796 Std: 0.24950084190176278
# Gi_Bin Hexbin -- Min: 0 Max: 3 Mean: 0.13562596599690882 Std: 0.6139149227966471
# GiZScore Square -- Min: -0.9121050657495412 Max: 26.793882363955976 Mean: 0.01605209523504999 Std: 1.9982895651380905
# GiPValue Square -- Min: 0.0 Max: 0.9988708234182447 Mean: 0.5634258781640121 Std: 0.21502490502060692
# Gi_Bin Square -- Min: 0 Max: 3 Mean: 0.06837606837606838 Std: 0.4407231695584317
Hexagon   
 GiZScoreGiPValueGi_Bin
Min-1.35190.00000.0000
Max19.25960.99773.0000
Mean0.04200.41310.1356
Std2.41770.24950.6139
Square   
 GiZScoreGiPValueGi_Bin
Min-0.91210.00000.0000
Max26.79390.99893.0000
Mean0.01610.56340.0684
Std1.99830.21500.4407
% Difference  
 GiZScoreGiPValueGi_Bin
Min48.2%NANA
Max-28.1%-0.1%0.0%
Mean161.6%-26.7%98.4%
Std21.0%16.0%39.3%

The optimized hot spot analysis tool gave some output when it ran. The summary statistics are the same as I calculated above. For the square method there were 3 outlier values whereas the hexagon grid did not have outliers. The tool calculates a optimal fixed distance band and it was 14,143 feet using the square grid and 21,675 using the hexagon grid. The difference in this band size will have the hexagon grid method looking at a larger area. The square grid calculated 63 output features statistically significant based on FDR correction and multiple testing for spatial dependence whereas the hexgaon grid calculated 125.

****
SQUARE

************************** Initial Data Assessment ***************************
Making sure there are enough weighted features for analysis....
- There are 2574 valid input features.
Evaluating the Analysis Field values....
- POINT_COUNT Properties:
Min:         0.0000
        Max:       885.0000
        Mean:        8.9355
        Std. Dev.:  31.7473
Looking for locational outliers....
- There were 3 outlier locations; these will not be used to compute the optimal fixed distance band.
***************************** Scale of Analysis ******************************
Looking for an optimal scale of analysis by assessing the intensity of clustering at increasing distances....
- The optimal fixed distance band is based on peak clustering found at 14143.0678 US_Feet
***************************** Hot Spot Analysis ******************************
Finding statistically significant clusters of high and low POINT_COUNT values....
- There are 63 output features statistically significant based on an FDR correction for multiple testing and spatial dependence.
*********************************** Output ***********************************
Creating output feature class: C:\Users\ryan\Documents\Ryan\GIS\HotspotCompare\HotSpotCompare.gdb\SquareTess_T1_Hot
- Red output features represent hot spots where high POINT_COUNT values cluster.
- Blue output features represent cold spots where low POINT_COUNT values cluster.
Completed script OptimizedHotSpotAnalysis...
****
HEX

************************** Initial Data Assessment ***************************
Making sure there are enough weighted features for analysis....
- There are 2588 valid input features.
Evaluating the Analysis Field values....
- POINT_COUNT Properties:
Min:         0.0000
        Max:       798.0000
        Mean:        8.8872
        Std. Dev.:  31.1839
Looking for locational outliers....
- There were no outlier locations found.
***************************** Scale of Analysis ******************************
Looking for an optimal scale of analysis by assessing the intensity of clustering at increasing distances....
- No optimal distance was found using this method.
Determining an optimal distance using the spatial distribution of features....
- The optimal fixed distance band is based on the average distance to 30 nearest neighbors: 21675.0000 US_Feet
***************************** Hot Spot Analysis ******************************
Finding statistically significant clusters of high and low POINT_COUNT values....
- There are 125 output features statistically significant based on an FDR correction for multiple testing and spatial dependence.
*********************************** Output ***********************************
Creating output feature class: C:\Users\ryan\Documents\Ryan\GIS\HotspotCompare\HotSpotCompare.gdb\HexTess_T1_Hot
- Red output features represent hot spots where high POINT_COUNT values cluster.
- Blue output features represent cold spots where low POINT_COUNT values cluster.
Completed script OptimizedHotSpotAnalysis...

 Here are the graphical results of the comparison.

Comparison_Hot

Overall the tool detected the same major hot spots using both grid methods. As the band size was larger in the hexagon method the hot spots are also larger in the map. Looking at the hexagon method there is a single cell hot spot between the two major ones on the left that was not detected in the square grid, and again I believe this will be related to the band size. 

From this practical exploration of the hot spot analysis and gridding methods we have shown that there are differences when using a hexagon or square grid to aggregate the data. Part of the difference is because the definition of what a neighbour is and the distance between them changes between the two methods. Secondly the number of points aggregated to each cell changes due to the shape and the location of the grid, we demonstrated in the summary statistics that the average and standard deviation were different. Using the optimized hot spot tool the optimal fixed distance band was automatically calculated and the length was larger for the hexagon grid. Although both tools calculated similar hot spots they were larger using the bigger band. The additional fields calculated by the hot spot analysis (GiZScore, GiPValue and Gi_Bin) are descriptive about the distribution of points and addition research into them should reveal more insights as to how the aggregation method changes the interpretation.

Further investigation in order to compare if the results produced are similar would involve learning more about how the hot spot analysis method works. Additionally in order to create a more apples to apples comparison of hot spots themselves both gridding methods should be run using the same fixed distance band.