**Introduction**

This project explains variations in home values across the United States. It is understood that home values are related to percent of the population with a bachelor’s degree (BA), percent who live in poverty, percent who live in rural areas and the percent who are white. This analysis will explore how a home’s location can also be an important variable in determining its value.

**Summary**

Home values are most associated with the percent of the population with a BA and the location of homes near spatial clusters of high home values. See below for a more detailed explanation about the results.

**Data and Software**

Data: United States Census county level data.

Software: QGIS and GeoDa.

**Overview of Geo-spatial Regression Analysis**

Geo-spatial regression analysis has three basic steps. The first is to map the variables of interest to explore variance of the values in space. The second is to test for spatial dependence (clustering in space) of the dependent variable in the analysis (in this case home values). lastly, to incorporate the spatial dependence (spatial lag) of the dependent variable in the regression model and interpret the results. A step-by-step analysis is provided below.

**Step 1: Mapping the Dependent Variable**

The first step of any geo-spatial regression analysis is to map the variables to explore spatial clustering. This analysis will be conducted at the county level. The dependent variable in the regression model is home values and this will be mapped first:

The map shows significant clustering of high home values (dark blue: $167260 to $993900) on the East and West Coast, and close to large cities and their suburban areas (i.e Chicago, Washington DC). Clustering of high home values also exists on Florida’s coast. Lower home values exist primarily in the Midwest, Southwest, Appalachia and the deep South. The clustering of home values indicates it can be used as in independent variable in a regression model. However, statistical tests will need to be evaluated before using the variable in a model.

**Step 2: Mapping the Independent Variables**

It is important to explore the extend the independent variables are distributed throughout the country. A randomly distributed variable indicates low spatial variance and a non-random distribution indicates high spatial variance (i.e. spatial clustering)

*Percent BA*

The percent of people with a Bachelor’s degree is highly clustered in the Northeast and the West. This is likely due to the availability of higher education and a higher rate of industrial jobs in these regions.

*Percent Poverty*

Poverty is mostly clustered in the Northeast, upper Midwest, the West Coast and most urban areas.

*Percent Rural*

The variable that is most randomly distributed is percent rural. Except for the cities and the West Coast and parts of the Midwest, the rural population is mostly randomly distributed throughout the country.

*Percent White*

A high percent of people that are white (above 92%) exist throughout most of the United States except for the south where it drops below 92%. This is likely due to the high African American population in the region.

**Step 3: Evaluating Spatial Clustering**

The objective of this step is to show statistical support for a non-random distribution of home value across the country, and how strong the relationship is between local home values and neighborhood home values.

In geo-spatial analysis a test for a non-random distribution is determined by calculating the Moran’s I. A high positive, statistically significant value indicates a non-random distribution of home value (high spatial dependence), and a low or negative value a random distribution of home values (low spatial dependence).

To determine what type of spatial dependence is significant, a first and second order weighted Moran’s I is calculated and compared. A highly significant first order test determines that adjacent neighbors (first order neighbors) should be incorporated in the model. A significant second order test (a test for the effect of being a neighbor of neighbor) shows that the model should include a second order weight.

*Moran’s I Calculation*

Both the first and second order calculations were highly statistically significant (p<<0.05). The Moran’s I plots show the correlation between home values (x-axis) and lagged home values (y-axis). The slope is high for the fitted lines indicating high spatial dependence. However, the first order Moran’s I is higher than the second order and therefore a first order spatial dependence lag will be incorporated in the regression model.

**Step 4: Classic Regression Model**

To further evaluate spatial dependence, a classic regression model is specified using all of the independent variables except for spatial lag. The model diagnostics are then evaluated and if the model shows that there is spatial lag, it will be included in subsequent models. Here are the results:

Variable | Coefficient | Std. Error | z-value | Probability |

% Rural | 54 | 39.00 | -2.38 | 0.1659 |

% Poverty | 2323 | 233.08 | 9.97 | 0.0000 |

% BA | 7970 | 185.67 | 42.92 | 0.0000 |

% White | 212 | 48.10 | 4.42 | 0.0001 |

The results show that percent rural is not statistically significant (p>>0.05). The other variables however are highly statistically significant (p<<0.05). For every one percent of having a BA results in $7970 increase in housing value, and for every one percent increase of whites living in counties increases the home value by $212. Interestingly, every one percent increase in poverty increases the housing value by $2323. High poverty rates in counties with high housing values (see above maps) could partially explain this positive correlation.

In addition, the spatial diagnostics shows that the Moran’s I error value (55.86) was very high and highly statistically significant (p<<0.05). These results indicated that the spatial lag should be included in the model.

**Step 5: Regression Model with Spatial Lag**

Now that we conducted the statistical tests to show that the first order spatial lag variable is significant, it is incorporated in the model:

Variable | Coefficient | Std. Error | z-value | Probability |

H. Value Spat Lag | 0.70 | 0.01 | 63.23 | 0.0000 |

% Rural | 49 | 25.06 | 1.94 | 0.0511 |

% Poverty | 491 | 151.04 | 3.25 | 0.0011 |

% BA | 4560 | 134.98 | 33.78 | 0.0000 |

% White | -40 | 31.21 | -1.29 | 0.1968 |

The results show that by adding the spatial lag percent white is not statistically significant (p>0.05) and percent rural is now statistically significant, although only marginally so (p = 0.05).

The spatial lag coefficient is highly statistically significant (p<<0.05) and is positively associated with home values. In interpretation, controlling for other variables in the model, for each additional dollar of home value on average for counties that are first order neighbors results in a 0.70 dollar in the local unit’s housing value. In other words, for every addition $1000 in home values on average for counties with first order neighbors, results in a $700 increase.

In addition, for every one percent of having a BA results in $4560 increase in housing value, and for every one percent increase in living in a rural area increases the home value by $49. Interestingly, every one percent increase in poverty increases the housing value by $491. Again, high poverty rates in counties with high housing values could partially explain this positive correlation.