This tutorial is going to go over how to join point data to polygon data in R and show you how to aggregate the data that you are joining. When finished you will have the sum, average and count of the points within each polygon. This data can then be saved as a shapefile for use in GIS software such as ArcMap or QGIS or you can plot it in R using the basic plot functions of a package like choroplethr.
setwd('F:/Portfolio/Tutorials/R_Spatial_Join') # set up workin gdirectory with shapefiles
#packages I need for sure
library(rgdal) # used to load the shapefiles
library(dplyr) # because I wouldn't leave home without it
library(RColorBrewer) # color pallets
library(ggmap) # used to plot at the end
missGrid <- readOGR(".","Mississauga_Grid_f")
bus <- readOGR(".","Businesses_salesonly")
# quick plot to make sure the data looks ok and has the points in the right place
plot(bus, add = TRUE)
So now that all of the data is loaded we are going to use the over function. Over is part of the sp library and a great resource for information on how over works can be found at https://cran.r-project.org/web/packages/sp/vignettes/over.pdf.
So the way I think about over is that it takes in two pieces of spatial data and puts one on top of the other. For my example I am using a point and a polygon file. My goal is to overly the points on top of the polygons. Typically, more than one point will fall into a polygon so I want to aggregate this data. I will take the sum, mean and a count of the points in each polygon.
The first step is to add unique identifiers to each of the spatial data frames we just loaded. It is possible you already had unique ids in the data you loaded and that’s fine to use. The over function then overlays the polygon file on top of my point file. This might seem counter intuitive but I used this order for aggregating. Once we know which polygon each business belongs to we can use the polygons to aggregate.
I use dplyr to do all my data munging in R. I found it very easy to learn and very intuitive. Although my friends seem to disagree I find it reads like the English language. The results of the over function isn’t actually a spatial data frame, its just a regular data frame. That’s fine, we can aggregate using it either way. I add back the bus_ids to the table using a mutate. If you take a look at the busmis table you will see that it has the original Id that was in the shapefile, id_grid (the grid id which that business sits in) and id_bus (the id of the business). It doesn’t have any other data. In order to get back to our business data where the interesting information is stored I use a left join on the original spatial dataframe. You can see here that I use the @data slot reference. Since a spatial data frame is so complicated it actually has multiple slots to store data. The @data slot has all of the information stored in the spatial data frame, in a regular data frame format.
The next step is to aggregate the data. Since each business has been assigned to a polygon we can sum the sales for each polygon, calculate the mean value and calculate the count, or how many points were in each polygon. To do this I first group the data using group_by and the id_grid, then I use the summarise feature out of dplyr.
Now that all of the data has been aggregated we can join it back to the original grid spatial dataframe, again using the @data slot. The result of this is that each of the polygons now has its aggregated sales volume.
# Set a unique identifier for both of the data frames
missGrid@data <- mutate(missGrid@data, id_grid = as.numeric(rownames(missGrid@data)))
bus@data <- mutate(bus@data, id_bus = as.numeric(rownames(bus@data)))
# The businesses get the value of the grid they are on top of
busmis <- over(bus, missGrid)
# the order didn't change so re add id_bus to the new table
busmis <- mutate(busmis, id_bus = as.numeric(rownames(busmis)))
#Now join each original business to its grid location
busmis <- left_join(bus@data, busmis, by = c("id_bus" = "id_bus"))
## Now we can aggregate the data
busmisa <- busmis %>% group_by(id_grid) %>%
summarise(avgSales = mean(SALES_VOL), sumSales = sum(SALES_VOL), nSales = n()) %>%
## Now you want to join it back to the grid data for mapping
# we are joining it straight to the missGrid spatial data frame
missGrid@data <- left_join(missGrid@data, busmisa, by = c("id_grid" = "id_grid"))
Since all of this data is joined to a spatial data frame it can easily be plotted. Let’s have a quick look at the results of this join.
cols <- brewer.pal(n = 4, name = "Greys")
# wow that shit was intense, but in the end the cut didnt work with na values
# so I just threw in a dplyr filter to get that data out and figure out our scales
lcols <- cut(filter(missGrid@data, !is.na(nSales))$sumSales,
breaks = quantile(filter(missGrid@data, !is.na(nSales))$sumSales),
labels = cols)
plot(missGrid, col = as.character(lcols))