Creating ‘Donuts’ from a single polygon layer in ArcMap using ArcPy and Pandas

Ah, donuts. Everyone loves delicious fried sweet dough covered in magical toppings and served for breakfast, lunch or supper. Would any one like a donut now? There is one job where you might love donuts more than others, and no I am not thinking of detectives. I am thinking of GIS professionals. In the GIS world a donut is simply a polygon with a hole in it. A couple of examples where this might come into play is if you have a lake polygon and want to remove islands from it or a circular building with an interior courtyard.

To create donuts it is fairly straight forward if you have two separate layers. The big polygon, or ‘host’ polygon is the polygon which you will be subtracting smaller polygons from. The subtractor polygons are the ones which are being removed from the host. In ArcMap this can be done simply using the Erase Tool (analysis) as long as your host polygons (Input Features) and your subtractor polygons (Erase Features) are in separate layers. 

donut_EraseTool

This process can become a little more complicated if the host and subtractor polygons are all on the same layer.  Why would you have both these sets of data in the same layer? Well it could happen when exporting a polygon layer from a CAD file, all of the donuts might get filled in during export process. If you are making a geological map if there are xenoliths within a host layer you would want to remove them so you don’t have overlapping polygons. There are many ways to end up with both host and subtractor polygons in the same layer. In this blog post I will discuss two methodologies to solve this problem using python, arcpy and creating an ArcGIS ArcMap toolbox.

The Donut Testing Input

To test the donut tools I will be working on I created a simple input file as seen below. I created a feature class within a file geodatabase because it has the benefit of automatically calculating and updating the SHAPE_Area and SHAPE_Length for each of the polygons within the file. I will use these fields in some of my analyses. I have also created these tools as a toolbox and you can download the ready to use tools here.

donut_TestDataSet

Each of the polygons shown is a solid polygon, so the smaller polygons are simply sitting on top of larger polygons. I tried to create several different examples of potential situations in a donut file. The most basic case is a single polygon completely contained within a polygon. This is the ideal situation to be dealing with and it is the scenario that I have designed my tools to work on. Next there is the possibility of multiple subtractor polygons within a host polygon (top left). A similar but different situation is nested donut holes, or donuts within a donut (starting to get a little bit like Inception) as seen on the bottom right. Another case is nested polygons which are not completely contained (i.e., they overlap outside of the host polygon) as seen in the bottom left and top middle. Polygons which are just touching each other also need to be dealt with so that extra data is not created when it is not required (bottom middle). The most complicated case in my example is seen in the middle of the figure (polygons 2, 4 and 7) where a polygon with a donut also overlaps another polygon. 

Donut Union Tool

The first method I will go over is by using the Union tool in ArcMap. The description for the Union Tool is ‘Computes a geometric union of the input features. All features and their attributes will be written to the output feature class.’ The picture shown in the help documents looks like this:

donut_UnionText

So any overlapping polygons will be turned into multiple polygons to create each individual pieces that comprises the original. This seems like a good start for our donut tool. The only problem is it ends up creating duplicate polygons. For a single host and a single subtractor polygon the new output file will have the original host polygon (minus the subtractor — successfully a donut!), the original subtractor polygon unchanged, and a new polygon which has been removed from the host polygon, this polygon happens to be identical to the original subtractor polygon. So although it has done the operation we wanted it has created extra data!

Luckily there is another ArcMap tool which can fix this problem quite easily, the Delete Identical Tool. The delete identical tool takes an input dataset and will delete records in a feature class or table which have identical values based on a list of fields. It is also possible to use the SHAPE field (i.e., it will compare shape geometries). So this tool fixes our problem of having additional identical polygons created.

donut_UnionTool

So lets have a look at the output created using our test input dataset. In the image below I have the original dataset labels in black text and the donut dataset labels in yellow. So looking at our single subtractor, multiple subtractor and nested subtractor polygons examples they all look like they have worked and created the desired output. The overhanging polygons which started as two polygons have all turned into three, the host, the subtractor that overhangs and the subtractor contained within the host. This could be an undesired behaviour. The nested polygon overhanging another polygon (IDs 2, 4, 7) has also behaved similarly where more polygons than originally existed have been created. Another thing to note from this example is that not every ID has remained the same. As the new polygons are added they are given a unique ID in sequence. When the delete identical tool is used it is not picky about what ID to keep and what to lose. In addition the creation of more polygons than we started with has led to more IDs being created.

Overall this methodology works for creating donut polygons. It works the best in situations where the subtractor polygon is completely contained in the host polygon. And we have shown that it works for multiple subtractor polygons and nested subtractor polygons. If the subtractor polygons overhang from the host polygons we begin to create more polygons than we initially started with. In addition ID values are not preserved, but attribute table values should be. 

The python code I wrote for this tool is below, and it is fairly short and simple.

####
# Donut Tool
# Ryan Ruthart
# ryanruthart.com

# Desciption
# This tool uses the union function to determine overlapping polygons in a layer. Duplicate versions of polygons
# will be created which are then removed using the DeleteIdentical tool

######
# Inputs

import arcpy

donutLayer = arcpy.GetParameterAsText(0) # Input file to remove donuts
donutLayerUnion = arcpy.GetParameterAsText(1) # Output file with donuts removed
idField = arcpy.GetParameterAsText(2) # Unique ID field to base the analysis off of -- MUST be unique

arcpy.Union_analysis(donutLayer, donutLayerUnion, "ALL", "", "NO_GAPS") # Use the union tool to create the donuts

# use the delete identical tool to remove duplicates created through the union process # use the delete identical tool to remove duplicates created through the union process
arcpy.DeleteIdentical_management (donutLayerUnion, ['SHAPE_Length', 'SHAPE_Area'])

Donut Spatial Join Tool

I wanted to try and figure out another way to perform the donut operation. My goal for this second attempt is to preserve the number of polygons in the file, if you started with 20 polygons there should only be 20 polygons in the output. I also wanted to maintain the polygon IDs, again if your host polygon ID was 1 and the subtractor ID was 2 the output should preserve these names (and the associated data). I also wanted to use arcpy search and update cursors as well as try using numpy and pandas which are now included with the ArcGIS python install. 

The first step for the process was to create a list of all the polygons which contain other polygons. In order to calculate this I used a spatial join to connect the input polygon to itself using the ‘one to many’ option. This created a new shapefile with a new record for every intersecting polygon. Although this gave me all of the overlapping polygons there was some extra data collected.

Using the input file I generated an example of this is polygons 10 and 11. Polygon 10 overlaps 10 a new row is created. Additionally polygon 10 overlaps 11 it creates a new row and polygon 11 overlaps 10 so it also creates a new row. From this example starting with two rows we now have 4, and the problem only gets bigger with more overlapping polygons. This is fine as long as we clean the data.

Cleaning the Spatial Join Data

I dont actually need to work with the polygons themselves to clean the data, I just need to know which polygons overlap each other. To get the attribute data into a table I used the arcpy data access FeatureClassToNumPyArray. I then used the pandas method DataFrame to turn it into a dataframe. There are likely other ways to do this process but I really wanted to do it in pandas to get some practice using it. When the spatial join was used it created TARGET_FID and JOIN_FID columns. I removed any rows where TARGET_FID = JOIN_FID as these are duplicates. Then I sorted the TARGET_FID and JOIN_FID columns by value so that the lower number was always in TARGET_FID. This allowed me to find duplicate values such as described above (10, 11 and 11, 10). I then created a combine column to and dropped all duplicate values. Then I used the pandas groupby function to group the rows by TARGET_FID. This allowed me to create a group of all overlapping polygons. From this group I created a list with each of the polygons in the group. 

The next step was to figure out which polygon is the largest. For my methodology I am simply taking the largest polygon (host) and subtracting every other overlapping polygon (subtractors) from it. So to do this I joined the TARGET_FID column back to the original dataset so I could access the SHAPE_Area column. I sorted by size and put the maximum value into a maxValue list. Then each other value (a minimum of one value but could be more) into a lowValue list. These lists correspond to a list of IDs for host polygons and a list of IDs for each subtractor polygons. The index for the lists match so they correspond to each other.

The final piece of puzzle is to loop through these lists and actually subtract the data. First I copy the input layer to a new layer to avoid breaking the input data set. Then looping through the maxValue list I used an arcpy SearchCursor to return the geometry of that row. Then I loop through the minValue list and subtract each of those polygons from the host polygon. I used the difference method (a method in all geometry objects) to do the subtraction. After each of the subtractor polygons have been removed from the host polygon I return the modified geometry. Then I created an update cursor to retrieve the host polygon and replace its geometry with the modified geometry. The python code I wrote is listed below.

####
# Donut Spatial Join Tool
# Ryan Ruthart
# ryanruthart.com

# Description
# Tool will create donuts given a single input polygon
# Uses spatial join to join the layer to itself to determine which polygons overlap.
# Then it does some data cleaning to remove duplicates
# Then out of each group of connected polygons it determines the largest polygon and
# subtracts each of the smaller polygons containeed within that polygon.

import arcpy as arcpy
import pandas as pd
import numpy as np
import os as os


os.chdir(r'C:\Users\ryan\Documents\Ryan\GIS\DonutTools')

###########
# Functions

def returnFieldNames(fc):
# Function that returns the field names in a shape file
# this version is modified to drop the 'SHAPE' field so it does not contain any tuples
  import arcpy
  fieldnames = [f.name for f in arcpy.ListFields(fc)]
  fieldnames.remove('SHAPE')
  del arcpy
  return fieldnames

def returnFeatureRow(fc, fieldNames, sqlQuery):
# Function that returns the feature row which matches a SQL query in a featureclass
# if more than one record is returned only the last one will be returned by the function
  with arcpy.da.SearchCursor(fc, fieldNames, sqlQuery) as cursor:
    for row in cursor:
      featureRow = row
  try:
      featureRow
  except NameError:
      featureRow = None
  return featureRow

def intersectGeoms(fc, idField, fieldNames, hostID, subtractIDs, sqlQuery):
# Function to intersect geometry from various shapefiles
# The function is designed using a hostID (large polygon) and then one or more subtractIDs

  hostRow = returnFeatureRow(fc, fieldNames, sqlQuery.format(idField, hostID)) # return the host geometry
  intersectGeom = hostRow[-1]
  arcpy.AddMessage(hostRow)
  for i in range(len(subtractIDs)): # loop through each of the polygons to be subtracted
    # arcpy.AddMessage("SubtractID {0}: {1}".format(i, subtractIDs[i]))

    subtractRow = returnFeatureRow(fc, fieldNames, sqlQuery.format(idField, subtractIDs[i]))
    # arcpy.AddMessage(subtractRow)

    intersectGeom = intersectGeom.difference(subtractRow[-1]) # subtract the polygon from the original polygon
    # but now the data have been updated
  return intersectGeom

# Load the input and output layer
inLayer = arcpy.GetParameterAsText(0)
inField = arcpy.GetParameterAsText(1)
outLayer = arcpy.GetParameterAsText(2)
memoryFeature = "in_memory" + "\\" + "memoryJoin" # Create a memory workspace to perform the spatial join
# If the data set is massive this might create an in memory error

# Perform a spatial join on the input layer to itself so we know which polygons are within each other
arcpy.SpatialJoin_analysis(inLayer, inLayer, memoryFeature,
  'JOIN_ONE_TO_MANY')


# Set up the SQL query to return one record at a time
sqlQuery = "{0} = {1}"

# Get the fieldnames of the input polygon
fieldNamesInLayer = returnFieldNames(inLayer)
# Create a numpy array of the input feature class data table
inLayerdf = arcpy.da.FeatureClassToNumPyArray(inLayer, fieldNamesInLayer)

# Load the field names and turn the data into a NumPy array for the joined table
fieldNamesJoinLayer = returnFieldNames(memoryFeature)
inLayerJoindf = arcpy.da.FeatureClassToNumPyArray(memoryFeature, fieldNamesJoinLayer)

########
# Now we have two dataframes, one with the original table data, and one with the joined table data
# now we can use the workflow I worked out in python and the numpy/pandas packages

jdf = pd.DataFrame(inLayerJoindf) # create a pandas dataframe
jdf = jdf[jdf['TARGET_FID']!=jdf['JOIN_FID']] # Remove polygons joined to themselves
jdfs = jdf[['TARGET_FID', 'JOIN_FID']] # take a subset with only these columns
jdfs = jdfs.apply(np.sort, axis = 1) # sort the data by columns (so lowest value is always on the left
jdfs = jdfs.sort(['TARGET_FID', 'JOIN_FID']) # sort by rows
jdfs['COMBINED'] = jdfs['TARGET_FID'].map(str) + jdfs['JOIN_FID'].map(str) # create a combined column using both IDs, this will let us look for unique combinations
jdfs = jdfs.drop_duplicates(['COMBINED']) # drop duplicate columns

arcpy.AddMessage(jdfs)

# Group and create group table

groupeddf = jdfs.groupby('TARGET_FID')

joined = {}
for key in list(sorted(groupeddf.groups.keys())):
  datalist = []
  datalist.append(key)
  for idx in groupeddf.groups[key]:
    datalist.append(jdfs['JOIN_FID'].ix[idx])
  joined[key] = datalist

# extract the largest polygon and the smaller polygons in each group

# Create the lists of max values and other values
# The output of this loop is to lists, maxValue and lowValue
# maxValue is always a single value, it is the maximum value in the group
# lowValue contains all values that are below the maxValue, it can contain 1 or more entries
odf = pd.DataFrame(inLayerdf) # load the original to a dataframe
maxValue = []
lowValue = []
for i in joined:
    tempdf = pd.DataFrame(joined[i])
    tempdf = pd.merge(tempdf, odf, left_on = 0, right_on = 'ID', how = 'left')
    tempdf = tempdf.sort(['SHAPE_Area'], ascending = 0)
    tempList = tempdf['ID'].tolist()
    maxValue.append(tempList[0])
    lowValue.append(tempList[1:])

# create a copy of the input layer to the output layer
arcpy.CopyFeatures_management(inLayer, outLayer)

# This loop goes through the output data set and selects each of the largest objects in a group
# it then goes through each of the contained polygons and subtracts them from the host polygon
# after it has subtracted each writes the new geometry to the output layer
arcpy.AddMessage("UPDATING DATA")
for i in range(len(lowValue)):
    newGeom = intersectGeoms(inLayer, inField, ['SHAPE@'], maxValue[i], lowValue[i], sqlQuery) # get the new geometry
    with arcpy.da.UpdateCursor(outLayer, [inField, 'SHAPE@'], sqlQuery.format(inField, maxValue[i])) as cursor:
    	arcpy.AddMessage(sqlQuery.format(inField, maxValue[i]))
    	for row in cursor:
    		row[-1] = newGeom
    		cursor.updateRow(row) #commit the change

Examining the Output

The results from this tool and my test dataset are shown below. The results are very different than those from the union approach. As you can see there are the same number of polygons in my output file as my input file. Also the IDs (and subsequent data) of the modified polygons also match the input data! So the main two goals of this method were a success. Single or multiple subtractor polygons within a host polygon worked as expected. Nested subtractor polygons also worked as we would hope. At first I did not think that these would work out, but because the groups are formed for each collection of overlapping polygons it did work. Looking at the example of polygons 16, 17, 18 we can see why. First a group with 16, 17 and 18 is formed, secondly another group of 17 and 18 is formed. From the first group 16 subtracts 17 and 18 and from the second group 17 subtracts 18. This creates the donut within a donut that we would expect (hope). Pretty neat! The last group to look at is over hanging polygons. Using the union approach these ended up creating 3 new polygons for every pair. Using this approach we can see that we have the same number of polygons. What happens is the host polygon subtracts the subtractor which remains intact as the original.

donut_SpatialJoinTool

Overall this approach was a great success. All of the use cases in my testing polygon file worked out, single donuts, multiple donuts, nested polygons and overhanging polygons all behaved as we hoped. I also got to play around with arcpy search cursors and do some data munging with numpy and pandas. I hope this tutorial was useful to you and it can help you solve the GIS donut problem. Thanks for reading and feel free to reach out if you have any questions.