class: inverse, middle # Lecture 7: Combining data — Attribute joins & spatial analysis .footnote[GIScience I, Fall 2018] --- ## GIS as process 1. Locate entities in the world. 2. Represent the entities digitally. 3. Describe the entities. 4. **Process, integrate, analyze.** 5. Communicate the results. 6. Audience: Interpret, make decisions, give feedback. --- ## Attribute joins * Join: Connecting two attribute tables using a common 'key' value. * Creates a temporary relationship. * Joins can be made more permanent via creating new output of the combination, or copying values from one table's attribute field to the other. .image-80[] --- ## Attribute join relationships * Can have a `one-to-one` or `many-to-one` relationship. * Database table joins can technically include `one-to-many` and `many-to-many`, but GIS applications rarely allow these. * The spatial representation would be problematic, when features would require more than one representation (extra rows in table). .left-column[ .image-120[] ] .right-column[ .image-120[] ] --- ## Spatial analysis * Can be thought of as similar to statistical analysis, but including a 2-D coordinate plain. * However assumptions of statistical analysis (normality, homogeneity, linearity, independence) are not valid here. * Spatial analysis are the steps we undertake to convert raw spatial data into the useful information we've set out to get. * Add value in transformations, which bring out patterns & anomalies. * When output combines dataset attributes, can be thought of as a spatial version of a join: joining datasets using the geometry as the key-field. * By the complex nature of geometry, has to use different method for linking than `dataset1.key = dataset2.key`. * The method by which the geometry are related is the *analysis*. ??? * Statistical assumptions: * Normaility: Somewhat normal distribution. * Homogeneity of variances: Data from multiple groups have the same variance. * Linearity: Data have a linear relationship. * Independence: Data are independent. --- ## Types of spatial analysis Three main types: 1. Analysis based on location (one dimension). 2. Analysis based on distance (two dimensions). 3. Analysis based on area (three dimensions). --- ### 1. Analysis based on location * Identifying **where** something exists or happens. * The ability to compare different properties of the same place. * Discover relationships, correlations, 'hotspots'. * Would be convenient if we had data model where every location was represented, with every property attributed directly on it. * However spatial data models and their implementations tend to emphasize carrying a few properties for all places over a lot of properties for one place. * Raster data model limits to one. * Maintenance simplicity and "normalization" tends to limit how much one can 'hang' on a single dataset. * Horizontal vs. vertical integration of data. * For that reason, we need ways to be able to collect the properties at a location: to bring all the attributes into a single feature. ??? * Normalization is the concept that a system shouldn't have more than one copy of a piece of data. * Using Duckweb as an example: your name should only be written in one place. * If another part of the system needs your name, it would just store an ID attribute (95 number) to connect one to the other. * That way if you ever changed your name, the change would automatically be available to all references, wih only a single update. --- ### 1a. 'Spatial' attribute join * Spatial attribute join is the same as an attribute join, but in this case the key value defines a spatial relationship * Example: Using a base feature class with cities' economic information; want to add some state-level info, e.g. unemployment. * So long as the cities have their state as an attribute (name, code, FIPS), should be joinable via their relationship. * **BE SURE** to not assume state attributes are equivalent to city attributes: properly document. ??? * 'Spatial join' is a somewhat problematic name. Spatial join commonly used to refer to spatial analysis tools that combine dataset attributes in general. --- ### 1b. Point-in-polygon operation * Determining whether a point lies inside or outside a polygon (also: which polygon?). * It's quite useful to apply areal properties to point locations. * Probably the most-used operation for assigning attributes to features (even non-point!). * What boundaries does it lie in? City, district, Census, neighborhood? * What areal classifications does it lie in? Zoning, flood areas, wetlands. .image-70[] ??? * Yes, even non-point features can use point-in-polygon, by 'pretending' to be a point (centroid). * Though not as easy to compute as an attribute join, still quite simple to solve. --- ### 1c. Polygon overlay * Similar to point-in-polygon, as two objects are involved. * Combination of two or more sets of features, essentially creating a new set of features. * Geometry is 'split' to define the spatial extent of the combinations. * Slivers: When a boundary line occurs in multiple input datasets but isn't an exact match. * Oddly: better data collection almost always **causes** more of these.  --- ###1.c.1 Polygon Overlay and Boolean Realtionships/Set theory * Between two polygons, there are [16 queryable Boolean relationships](http://www.geo.upm.es/postgrado/CarlosLopez/materiales/cursos/www.ncgia.ucsb.edu/giscc/units/u186/figures/figure1.gif)  ??? * Better data causes slivers: More data points on the line = more ever-tinier ones. --- ### 1d. Raster-on-raster overlay: Map algebra * If cells are identical (or one is subdivision of other): * Overlay vastly simplified: raster indexes match or are convertible; no slivers * If cells are not identical, would require a resampling. .image-60[] ??? * Resampling: converting a raster to a different cell grid, deriving new values using a method of interpolation. * Not covering resampling in this class. --- ### 2. Analysis based on distance * How near or far features are from each other can indicate a spatial dependence (or lack thereof). * The separation of features or events across an area can have meaning in itself. * Types of business near a school/stadium/freeway exit. * Where accidents occur the most. * Neighborhoods with few nearby grocery stores or restaurants. .image-70[] .footnote[US Dept. of Transportation, National Transportation Noise Map] --- ### 2a. Measuring distance/length * Euclidean/Pythagorean (straight-line) distance. * Easy math; can use projected coordinate system coordinates. * Drawbacks: Distance doesn't include topography, Earth's curvature.  * Great circle (geodesic) distance. * Network distance. * Requires special dataset(s): topology, traversal rules. * Holy grail of online maps/GIS. * Note: Representation always underestimates: lacking topography & curvature. ??? * Euclidean: not for use over longer distances. * Many GIS appplications do have both Pythagorean (coordinate system) and great circle (geodesic) distance. --- ### 2b. Buffering * Builds new feature(s) by drawing boundary around areas within a given distance of the original feature(s). * Simple-but-effective gateway to other forms of spatial analysis. * For points & polylines, increases the dimensionality. * Polygon-end of point-in-polygon operation. * Polygon overlay. .image-70[] --- .left-column[ Buffer (straight-line or geodesic) .image-110[] ] .right-column[ Service Area (network) .image-110[] ] --- ### 2c. Cluster detection * Determining a consistency in the distance between a set of features (a pattern). * Three possibilities: * Random pattern: features are located independently, all locations are equally likely to have a feature. * Clustered: Some locations are more likely to have features than others. * Dispersed: Presence of a feature makes other features less likely in the vicinity.  * Useful for points (or point-like) discrete features or events. * Existence of clusters suggests causal factors. --- ### 3. Analysis based on area The higher dimensionality here may suggest a jump in complexity, but that's not exactly the case. * Entirely about the geometry/shape of the area in question. * Comparison between features is to compare between properties about the geometry, not some difficult function. * Projected coordinates come into play here. --- ### 3a. Measurement of area * Measurement of the 2-dimensional projected space within the area/polygon. * Usually same units as coordinates (can convert). * Be aware of projection distortion (equal-area). * Another measurement of an area is the measurement of the perimeter of the area. * Including perimeter of 'donut holes'. * Simple to include, since most data models construct polygons as sets of closed polylines. .left-column[ Area .image-90[] ] .right-column[ Perimeter .image-90[] ] ??? * Some GIS data models have automatically-updating length/perimeter & area attributes. * Perimeter can be useful as a part of other measurements. --- ### 3b. Measurement of shape * Can calculate mathematical formulas that describe the formation of the polygon. * Example: Compactness (or thinness): How compact or spread-out (thin) a polygon is. * Generally, a relationship between the perimeter and area. * e.g. `(4π*area)/(perimeter²)` (1 = perfect circle, (approaching) 0 = ever more line-like). *Texas 35th Congressional District:* .image-70[] --- ### 3c. Centrality * Spatial equivalents to the statistical mean and standard deviation? * Center is 2-D equivalent of mean: a weighted average of X & Y coordinates represented by a point. Mean center of values (U.S. population): The values add 'weight' to the coordinates. .image-70[] --- #### Centroid A mean of unweighted coordinates. .left-column[ Feature centroid. .image-100[] ] .right-column[ Sometimes true centroid falls outside polygon. Alternative: `visual center`, or `polygon label point`. .image-100[] ] --- #### Central Feature Of all the features, which one most closey falls in the center of the X & Y coordinates of all features (median). .image-100[]