Module 3 — Spatial Data Integration

PAF 516 | Community Analytics

M3 Overview & Learning Materials

Spatial Joins, Buffer Analysis & Environmental Justice

Module Overview and Objectives

So far, your economic hardship index is built entirely from census variables — measures of who lives in a place. But neighborhood quality is also shaped by what is near a place. Proximity to environmental hazards, grocery stores, transit stops, or health clinics matters as much as demographic composition. This module teaches you to combine tabular census data with external point data using spatial operations.

The central challenge is that these datasets arrive in different formats, different coordinate reference systems, and different geographic units. You will learn to align them using spatial joins, buffer analysis, and coordinate transformations — the core operations of spatial data integration. Along the way, you will encounter a question every proximity analyst must answer: does it matter whether you draw your buffer at 0.5 miles or 1 mile? The answer is yes — and the reason has a name.

After completing this module, you will be able to:

Explain the role of coordinate reference systems (CRS) and transform data between projections using st_transform()
Perform point-in-polygon spatial joins using st_join() to count point features within census boundaries
Conduct buffer analysis using st_buffer() to measure proximity to spatial features
Recognize the Uncertain Geographic Context Problem (UGCoP) and explain how buffer distance choices affect analytical conclusions
Frame spatial analysis within an environmental justice context, connecting proximity to environmental burdens with socioeconomic hardship
Create an enriched economic hardship index that incorporates spatial access variables alongside census demographics

Lecture

The lecture notes cover coordinate reference systems, spatial joins, buffer analysis, the UGCoP, and the environmental justice framework.

Download the lecture notes: Spatial Data Integration — Lecture Notes (PDF)

Section 1: Coordinate Reference Systems

Before you can combine spatial datasets, they must share the same coordinate reference system (CRS). A CRS defines how locations on the curved Earth are represented on a flat surface. Two data layers that appear to be in the same location but use different CRS definitions will not align — your spatial joins will produce nonsense results.

Geographic vs. Projected CRS

Geographic CRS (e.g., WGS 84 / EPSG:4326): Uses latitude and longitude in degrees. Good for global data and web mapping. Distances and areas are distorted because degrees are not constant-length units
Projected CRS (e.g., NAD83 / EPSG:26912 — UTM Zone 12N, or Arizona Central / EPSG:2868): Projects the globe onto a flat surface with linear distance units. Distances and areas are accurate within the projection’s valid zone. Essential for buffer analysis

Practical Rule

Always check CRS with st_crs() before any spatial operation. If layers differ, use st_transform() to reproject. For operations requiring accurate distances — buffers, area calculations, nearest-neighbor searches — always project to an appropriate local CRS first.

Common mistake: Creating a buffer of “1000” on data in EPSG:4326 produces a buffer of 1000 degrees, not 1000 meters. At the latitude of Phoenix, 1 degree of longitude is roughly 95 km. Always project first.

Section 2: Spatial Joins and Overlay Operations

A spatial join attaches attributes from one spatial layer to another based on geographic relationship — analogous to a SQL table join, but using location instead of a shared key column.

Point-in-Polygon Joins

The most common spatial join in community analytics: “How many facilities, events, or resources fall within each census tract or block group?” The st_join() function in the sf package performs this operation:

joined <- st_join(polygons, points, join = st_intersects)

After joining, group by the polygon ID and count matched points to produce a per-area count variable you can add to the economic hardship index.

Buffer Analysis

Buffers create a zone of specified distance around a feature. In community analytics, common uses include:

Identifying census block groups within 1 mile of a hazardous facility
Counting grocery stores within 0.5 miles of each neighborhood centroid
Measuring whether a census tract is within a hospital service area

In R: st_buffer(points, dist = 1609) creates a 1-mile (1609 meter) buffer around each point — assuming a projected CRS in meters. The choice of distance is consequential and is addressed in Section 3.

Intersection and Clipping

st_intersection() returns only the portions of features that overlap between two layers. This is useful for clipping a statewide dataset to your study area, or computing the fraction of a census tract that falls within a floodplain or service area.

Section 3: The Uncertain Geographic Context Problem (UGCoP)

In Module 2 we introduced the Modifiable Areal Unit Problem (MAUP) — the sensitivity of analysis results to how polygon boundaries are drawn. Buffer analysis introduces an analogous problem specific to proximity-based research: results change depending on the distance threshold you choose to define “near.”

This problem was formally identified and named by geographer Mei-Po Kwan in 2012 as the Uncertain Geographic Context Problem (UGCoP).

What UGCoP Says

Kwan’s argument is that in any proximity-based analysis, the spatial context we assign to individuals or geographic units — the “exposure zone” — is inherently uncertain. The researcher must choose a distance: 0.25 miles? 0.5 miles? 1 mile? 2 miles? There is rarely a principled theoretical basis for this choice, and different choices produce different results.

“The UGCoP arises because the actual geographic context that influences people’s behavior and health is uncertain and cannot be directly observed.” — Kwan (2012)

UGCoP vs. MAUP: Two Distinct Problems

	MAUP	UGCoP
What varies	Polygon boundaries (size and shape)	Buffer radius (distance threshold)
Mechanism	Aggregation changes summary statistics	Exposure zone definition changes who is “exposed”
Classic form	County vs. tract vs. block group	0.5-mile vs. 1-mile vs. 2-mile buffer
Named by	Openshaw & Taylor (1979)	Kwan (2012)
Applies to	Census polygon analysis	Point-based proximity analysis

Both MAUP and UGCoP are instances of the same underlying challenge: geographic context is a methodological choice, not a given, and that choice shapes results.

Implications for Buffer Analysis

When you report that a certain percentage of high-hardship block groups fall within 1 mile of an EPA facility, you are implicitly claiming that 1 mile is the relevant exposure distance. This claim deserves scrutiny:

Is there biological or behavioral evidence for the chosen threshold? (e.g., air dispersion models for pollution, walkability studies for food access)
Do results change substantially at 0.5 miles vs. 2 miles? If so, conclusions are threshold-sensitive
Is the threshold consistent with comparable studies? Maantay (2002) and the EPA EJScreen tool use varying buffers for different pollutant types

Best practice: Report results at multiple buffer distances and note where conclusions are stable vs. threshold-sensitive. A result that holds at 0.5, 1.0, and 1.5 miles is far more credible than one that only appears at a single distance.

UGCoP in the Lab

In Lab 3, Q2 asks you to change the buffer from 1 mile to 0.5 miles and compare results. This is a direct application of UGCoP: you are testing whether the conclusion that “certain block groups are near facilities” is sensitive to the distance threshold. If the enriched economic hardship index changes substantially at 0.5 miles, UGCoP is at work.

Section 4: Environmental Justice and Spatial Equity

The spatial operations in this module are foundational to environmental justice (EJ) research — the field that examines whether environmental burdens are disproportionately concentrated in low-income and minority communities.

The Environmental Justice Framework

EJ analysis connects where people live (census demographics) with what they are exposed to (environmental hazard data) using the tools you are learning:

Point-in-polygon joins: How many EPA-regulated facilities are within each census block group?
Buffer analysis: What fraction of residents live within 1 mile of a Superfund site?
Cumulative burden: Does proximity to hazards compound existing socioeconomic hardship — and does this burden fall disproportionately on communities of color?

The core finding of decades of EJ research (Pastor et al., 2001; Mohai & Saha, 2015) is that it does: environmental burdens are not randomly distributed across space but are systematically concentrated in disadvantaged communities.

Methodological Considerations

Every analytic choice in an EJ study carries implications for the conclusions:

Buffer distance choice: A 1-mile buffer catches more facilities than a 0.25-mile buffer, changing which communities appear “exposed” (UGCoP, Section 3)
Count vs. presence: Counting facilities vs. recording binary presence (in/out) can give different results, especially near large industrial complexes with multiple registrations
Data currency: EPA TRI data updates annually; historical siting (where facilities were built) differs from current exposure — Maantay (2002) and Pastor et al. (2001) both stress the importance of temporal context
Confounding by land use: Industrial zones concentrate both facilities and lower-cost housing, which independently attracts lower-income residents; separating siting effects from residential sorting requires longitudinal data (Mohai & Saha, 2015)

EJScreen

The EPA’s EJScreen tool — now maintained by the Open Environmental Data Project after EPA removed it in 2025 — operationalizes cumulative burden analysis at the census block group level. It combines 11 environmental indicators (air quality, proximity to Superfund sites, wastewater discharge, etc.) with 6 demographic indicators to produce a composite score. The enriched economic hardship index you build in Lab 3 applies the same logic at a smaller scale.

Readings

Maantay, J. (2002). Mapping environmental injustices: Pitfalls and potential of geographic information systems in assessing environmental health and equity. Environmental Health Perspectives, 110(Suppl 2), 161–171. open access
- The seminal article on using GIS for environmental justice analysis. Examines how spatial methods reveal disproportionate environmental burdens while cautioning against pitfalls — including buffer distance choice and geographic unit selection.
Kwan, M. P. (2012). The uncertain geographic context problem. Annals of the Association of American Geographers, 102(5), 958–968. doi:10.1080/00045608.2012.687349
- Introduces the UGCoP — the spatial analog to MAUP for distance-based analysis. Argues that the geographic context assigned to individuals or areal units is inherently uncertain and that different context definitions produce different results. The theoretical foundation for buffer sensitivity analysis.
Pebesma, E. (2018). Simple features for R: Standardized support for spatial vector data. The R Journal, 10(1), 439–446. open access
- The technical reference for the sf package. Explains st_join(), st_buffer(), st_intersection(), st_transform(), and the simple features standard underlying all spatial operations in this course.
Pastor, M., Sadd, J., & Hipp, J. (2001). Which came first? Toxic facilities, minority move-in, and environmental justice. Journal of Urban Affairs, 23(1), 1–21. doi:10.1111/0735-2166.00072
- A landmark spatial analysis study using longitudinal data and buffer-based proximity measures to test whether hazardous facilities were disproportionately sited in minority communities. Findings support the disproportionate siting hypothesis.
Mohai, P., & Saha, R. (2015). Which came first, people or pollution? A review of theory and evidence from longitudinal environmental justice studies. Environmental Research Letters, 10(12), 125011. open access
- Reviews the methodological evolution of EJ spatial analysis, including the shift from unit-based to distance-based proximity measures. Directly relevant to this module’s buffer analysis workflow.
Walker, K. (2023). Analyzing US Census Data: Methods, Maps, and Models in R. CRC Press. Chapter 7: Spatial Analysis with US Census Data.
- The primary reference for this module. Covers spatial joins, distance and proximity analysis, and exploratory spatial data analysis using tidycensus and sf.
U.S. EPA. EJScreen: Environmental Justice Screening and Mapping Tool. https://screening-tools.com/epa-ejscreen (maintained by Open Environmental Data Project; original EPA page removed 2025)
- A national screening tool combining 11 environmental and 6 demographic indicators at the block group level. A real-world application of the spatial integration methods in this module.

R Package Documentation

sf package documentation — spatial operations vignette
tidycensus package — Census data with geometry
ggspatial package — scale bars and north arrows

Lab 3

The Lab 3 materials are on the course lab site.

Lab 3 Tutorial — Download the tutorial file, knit it to see the complete analysis, then run chunk by chunk to understand each step.
Lab 3 Assignment — Download the assignment file, rename it with your last name, complete the three questions, and submit to Canvas.

Yellowdig Discussion

Environmental justice research depends on spatial methods to connect where people live with what they are exposed to. Every methodological choice — buffer distance, geographic unit, data source — shapes the conclusions drawn.

Discussion prompt: Choose a real environmental or public health concern in a community you are familiar with (e.g., proximity to industrial sites, food deserts, lack of transit access, flood risk). Drawing on the readings and module concepts:

What spatial data sources would you combine with census data to measure this concern, and at what geographic scale? Why does scale choice matter?
How would you operationalize “proximity” — through point-in-polygon counts, buffer distances, or another measure? What are the trade-offs?
Drawing on Kwan (2012), how would the UGCoP affect your analysis? What buffer distance would you choose, and how would you justify it?

Key Terms

Term	Definition
Coordinate Reference System (CRS)	A system defining how spatial coordinates map to locations on Earth
Geographic CRS	CRS using latitude/longitude in angular degrees (e.g., WGS 84 / EPSG:4326)
Projected CRS	CRS projecting Earth onto a flat surface with linear distance units (e.g., UTM)
Spatial Join	Combining attributes from two spatial layers based on geographic relationship
Point-in-Polygon	Spatial operation identifying which polygon contains each point
Buffer	A zone of specified distance around a spatial feature
UGCoP	Uncertain Geographic Context Problem — sensitivity of proximity analysis results to the choice of buffer size or exposure zone definition (Kwan, 2012)
Exposure Zone	The geographic area within which an individual or unit is considered “exposed” to a feature or hazard
Buffer Sensitivity	The degree to which analysis results change when the buffer distance is varied
Enriched Index	A composite index combining demographic variables with spatial access or environmental burden variables
Environmental Justice	Field examining whether environmental burdens fall disproportionately on low-income or minority communities
Cumulative Burden	The combined effect of multiple environmental and socioeconomic disadvantages in one community