Lecture 7: Combining data — Attribute joins & spatial analysis

class: inverse, middle
# Lecture 7: Combining data — Attribute joins & spatial analysis
.footnote[GIScience I, Fall 2018]

---
## GIS as process

1. Locate entities in the world.

2. Represent the entities digitally.

3. Describe the entities.

4. **Process, integrate, analyze.**

5. Communicate the results.

6. Audience: Interpret, make decisions, give feedback.

---
## Attribute joins

* Join: Connecting two attribute tables using a common 'key' value.

* Creates a temporary relationship.
  * Joins can be made more permanent via creating new output of the combination, or copying values from one table's attribute field to the other.

.image-80[![https://pro.arcgis.com/en/pro-app/tool-reference/data-management/add-join.htm](media/GUID-C441B51F-B581-4743-A975-3EB04087838C-web.gif)]

---
## Attribute join relationships

* Can have a `one-to-one` or `many-to-one` relationship.

* Database table joins can technically include `one-to-many` and `many-to-many`, but GIS applications rarely allow these.
  * The spatial representation would be problematic, when features would require more than one representation (extra rows in table).

.left-column[
.image-120[![https://desktop.arcgis.com/en/arcmap/latest/manage-data/tables/about-joining-and-relating-tables.htm](media/GUID-6FF790F5-02F7-4CC5-83AC-17FF1D4EBDA4-web.png)]
]
.right-column[
.image-120[![https://desktop.arcgis.com/en/arcmap/latest/manage-data/tables/about-joining-and-relating-tables.htm](media/GUID-1786A5F6-AEB9-48D5-8071-C94115A4EEA0-web.png)]
]

---
## Spatial analysis

* Can be thought of as similar to statistical analysis, but including a 2-D coordinate plain.

* However assumptions of statistical analysis (normality, homogeneity, linearity, independence) are not valid here.

* Spatial analysis are the steps we undertake to convert raw spatial data into the useful information we've set out to get.
  * Add value in transformations, which bring out patterns & anomalies.

* When output combines dataset attributes, can be thought of as a spatial version of a join: joining datasets using the geometry as the key-field.
  * By the complex nature of geometry, has to use different method for linking than `dataset1.key = dataset2.key`.
  * The method by which the geometry are related is the *analysis*.

???
* Statistical assumptions:
  * Normaility: Somewhat normal distribution.
  * Homogeneity of variances: Data from multiple groups have the same variance.
  * Linearity: Data have a linear relationship.
  * Independence: Data are independent.

---
## Types of spatial analysis

Three main types:

1. Analysis based on location (one dimension).

2. Analysis based on distance (two dimensions).

3. Analysis based on area (three dimensions).

---
### 1. Analysis based on location

* Identifying **where** something exists or happens.

* The ability to compare different properties of the same place.
  * Discover relationships, correlations, 'hotspots'.

* Would be convenient if we had data model where every location was represented, with every property attributed directly on it.

* However spatial data models and their implementations tend to emphasize carrying a few properties for all places over a lot of properties for one place.
  * Raster data model limits to one.
  * Maintenance simplicity and "normalization" tends to limit how much one can 'hang' on a single dataset.
  * Horizontal vs. vertical integration of data.

* For that reason, we need ways to be able to collect the properties at a location: to bring all the attributes into a single feature.

???
* Normalization is the concept that a system shouldn't have more than one copy of a piece of data.
  * Using Duckweb as an example: your name should only be written in one place.
  * If another part of the system needs your name, it would just store an ID attribute (95 number) to connect one to the other.
  * That way if you ever changed your name, the change would automatically be available to all references, wih only a single update.

---
### 1a. 'Spatial' attribute join

* Spatial attribute join is the same as an attribute join, but in this case the key value defines a spatial relationship

* Example: Using a base feature class with cities' economic information; want to add some state-level info, e.g. unemployment.
  * So long as the cities have their state as an attribute (name, code, FIPS), should be joinable via their relationship.
  * **BE SURE** to not assume state attributes are equivalent to city attributes: properly document.

???
* 'Spatial join' is a somewhat problematic name. Spatial join commonly used to refer to spatial analysis tools that combine dataset attributes in general.

---
### 1b. Point-in-polygon operation

* Determining whether a point lies inside or outside a polygon (also: which polygon?).
* It's quite useful to apply areal properties to point locations.
* Probably the most-used operation for assigning attributes to features (even non-point!).
  * What boundaries does it lie in? City, district, Census, neighborhood?
  * What areal classifications does it lie in? Zoning, flood areas, wetlands.

.image-70[![http://giscommons.org/analysis/](media/5.12.gif)]

???
* Yes, even non-point features can use point-in-polygon, by 'pretending' to be a point (centroid).
* Though not as easy to compute as an attribute join, still quite simple to solve.

---
### 1c. Polygon overlay

* Similar to point-in-polygon, as two objects are involved.

* Combination of two or more sets of features, essentially creating a new set of features.

* Geometry is 'split' to define the spatial extent of the combinations.

* Slivers: When a boundary line occurs in multiple input datasets but isn't an exact match.
  * Oddly: better data collection almost always **causes** more of these.

![https://desktop.arcgis.com/en/arcmap/latest/tools/coverage-toolbox/identity.htm](media/GUID-E0535FDB-1F08-418C-B6E9-EB27FDC490D4-web.gif)

---
###1.c.1 Polygon Overlay and Boolean Realtionships/Set theory
* Between two polygons, there are [16 queryable Boolean relationships](http://www.geo.upm.es/postgrado/CarlosLopez/materiales/cursos/www.ncgia.ucsb.edu/giscc/units/u186/figures/figure1.gif)

![http://www.geo.upm.es/postgrado/CarlosLopez/materiales/cursos/www.ncgia.ucsb.edu/giscc/units/u186/figures/figure1.gif](media/figure1_BooleanRelationships.gif)

???
* Better data causes slivers: More data points on the line = more ever-tinier ones.

---
### 1d. Raster-on-raster overlay: Map algebra

* If cells are identical (or one is subdivision of other):
  * Overlay vastly simplified: raster indexes match or are convertible; no slivers

* If cells are not identical, would require a resampling.

.image-60[![http://gisgeography.com/map-algebra-global-zonal-focal-local/](media/Land-Surface-Temperature-Subtraction.png)]

???
* Resampling: converting a raster to a different cell grid, deriving new values using a method of interpolation.
  * Not covering resampling in this class.

---
### 2. Analysis based on distance

* How near or far features are from each other can indicate a spatial dependence (or lack thereof).

* The separation of features or events across an area can have meaning in itself.
  * Types of business near a school/stadium/freeway exit.
  * Where accidents occur the most.
  * Neighborhoods with few nearby grocery stores or restaurants.

.image-70[![https://www.wired.com/2017/03/first-national-noise-map-charts-americans-aural-misery/](media/SF-NEW-1.jpg)]
.footnote[US Dept. of Transportation, National Transportation Noise Map]

---
### 2a. Measuring distance/length

* Euclidean/Pythagorean (straight-line) distance.
  * Easy math; can use projected coordinate system coordinates.
  * Drawbacks: Distance doesn't include topography, Earth's curvature.
![http://www.gogeometry.com/pythagoras/right_triangle_formulas_facts.htm](media/pythagoras_cartesian.gif)

* Great circle (geodesic) distance.

* Network distance.
  * Requires special dataset(s): topology, traversal rules.
  * Holy grail of online maps/GIS.

* Note: Representation always underestimates: lacking topography & curvature.

???
* Euclidean: not for use over longer distances.
* Many GIS appplications do have both Pythagorean (coordinate system) and great circle (geodesic) distance.

---
### 2b. Buffering

* Builds new feature(s) by drawing boundary around areas within a given distance of the original feature(s).

* Simple-but-effective gateway to other forms of spatial analysis.
  * For points & polylines, increases the dimensionality.
    * Polygon-end of point-in-polygon operation.
    * Polygon overlay.

.image-70[![https://desktop.arcgis.com/en/arcmap/latest/tools/analysis-toolbox/buffer.htm](media/GUID-267CF0D1-DB92-456F-A8FE-F819981F5467-web.png)]

---
.left-column[
Buffer (straight-line or geodesic)
.image-110[![http://www-personal.umich.edu/~copyrght/image/solstice/win02/schlossb/visualaccess.htm](media/image003.gif)]
]
.right-column[
Service Area (network)
.image-110[![http://www-personal.umich.edu/~copyrght/image/solstice/win02/schlossb/visualaccess.htm](media/image006.gif)]
]

---
### 2c. Cluster detection

* Determining a consistency in the distance between a set of features (a pattern).

* Three possibilities:
  * Random pattern: features are located independently, all locations are equally likely to have a feature.
  * Clustered: Some locations are more likely to have features than others.
  * Dispersed: Presence of a feature makes other features less likely in the vicinity.

![https://desktop.arcgis.com/en/arcmap/latest/tools/spatial-statistics-toolbox/average-nearest-neighbor.htm](media/GUID-44631E02-01AF-4524-B552-4B4797876941-web.gif)

* Useful for points (or point-like) discrete features or events.

* Existence of clusters suggests causal factors.

---
### 3. Analysis based on area

The higher dimensionality here may suggest a jump in complexity, but that's not exactly the case.

* Entirely about the geometry/shape of the area in question.

* Comparison between features is to compare between properties about the geometry, not some difficult function.

* Projected coordinates come into play here.

---
### 3a. Measurement of area

* Measurement of the 2-dimensional projected space within the area/polygon.
  * Usually same units as coordinates (can convert).
* Be aware of projection distortion (equal-area).
* Another measurement of an area is the measurement of the perimeter of the area.
  * Including perimeter of 'donut holes'.
  * Simple to include, since most data models construct polygons as sets of closed polylines.

.left-column[
Area
.image-90[![https://www.mathsisfun.com/geometry/area-irregular-polygons.html](media/area-irregular-polygon3.gif)]
]
.right-column[
Perimeter
.image-90[![http://davidvs.net/math/geometry-concepts.shtml](media/perimeterpolygon.png)]
]

???
* Some GIS data models have automatically-updating length/perimeter & area attributes.
* Perimeter can be useful as a part of other measurements.

---
### 3b. Measurement of shape

* Can calculate mathematical formulas that describe the formation of the polygon.
* Example: Compactness (or thinness): How compact or spread-out (thin) a polygon is.
  * Generally, a relationship between the perimeter and area.
    * e.g. `(4π*area)/(perimeter²)` (1 = perfect circle, (approaching) 0 = ever more line-like).

*Texas 35th Congressional District:*
.image-70[![http://www.npr.org/sections/thetwo-way/2017/03/11/519839892/federal-court-rules-three-texas-congressional-districts-illegally-drawn](media/texas-35-25810048abae74e2926130ee177bb5dfd1859e54-s700-c85.png)]

---
### 3c. Centrality

* Spatial equivalents to the statistical mean and standard deviation?

* Center is 2-D equivalent of mean: a weighted average of X & Y coordinates represented by a point.

Mean center of values (U.S. population): The values add 'weight' to the coordinates.

.image-70[![https://en.wikipedia.org/wiki/Mean_center_of_the_United_States_population](media/US_Mean_Center_of_Population_1790-2010.png)]

---
#### Centroid

A mean of unweighted coordinates.

.left-column[
Feature centroid.

.image-100[![http://giscommons.org/earth-and-map-preprocessing/](media/3.17.gif)]
]

.right-column[
Sometimes true centroid falls outside polygon. Alternative: `visual center`, or `polygon label point`.

.image-100[![https://www.mapbox.com/blog/polygon-center/](media/28967581056_bc954a3174_o.png)]
]
---
#### Central Feature

Of all the features, which one most closey falls in the center of the X & Y coordinates of all features (median).

.image-100[![https://desktop.arcgis.com/en/arcmap/latest/tools/spatial-statistics-toolbox/central-feature.htm](media/GUID-10E5C859-6101-4D48-9DED-C0F2D4CFC8F2-web.png)]