Education: The Achievement Gap

While a graduate student at UC Berkeley, I took a statistics course in linear and logistic regression and it introduced me to the issues with the achievement gap in secondary schools. For the analysis, I used linear and logistic regression packages in Stata and a survey of eighth-graders from the National Education Longitudinal Study.

I was interested in understanding the potential for students to enter and succeed in a STEM field.  To achieve this goal, I ran two regressions.The first was a linear regression with math test scores as the response variable and socio-economic status, future plans, locus of control, self-concept, race and gender as the explanatory variables.The second was a logistic regression with the same explanatory variables as the linear regressions but with being “held back” as the response variable. Locus of control can be interpreted as a measure of control over “life chances” and self-concept a measure of self-esteem.

I ran two models for both the linear regression and logistic regressions, one with the response variable and the explanatory variables and the other with interactions. I added higher order polynomial terms for continuous variables that were not initially significant to test whether there was a non-linear relationship between the response and explanatory variable. I used backward elimination to eliminate explanatory variables with p>0.05 and I checked for changes in R squared to ensure that it did not change significantly after a variable was dropped.

Below discusses the results of the regression analysis. The results show that math test scores  and success in school are highly associated with the future plans of attending college, socio-economic status and race. The paper with the complete analysis including hypotheses  and graphs can be found here: Achievement Gap Paper.

1. Linear Regression


The most significant effect on the mean standard math scores was future plans. Comparing children who had plans to finish high school to those who did not, there was no significant difference in mean standard scores. The differences rose when children stated they planned on attending a vocational school or college, mean standard scores on average were higher by 2.27 and 2.37 respectively compared to those who would not finish high school. For students who planed on finishing college or who had plans on post college degrees, the mean standard math scores on average increased by 6.3 and 8.05 points respectively.

The regression also revealed strong associations between math test scores and socio-economic status and race. A one unit increase in socio-economic status (range -2.894 to 1.854, sd=0.8) resulted in an approximate 3.75 point increase in the estimated mean standardized math score,  and a one unit increase in locus of control (range -3 to 1.52, sd=0.7) resulted in a 1.84 increase in the estimated mean standardized math score. Race was also significantly associated with math scores, on average when controlling for the other explanatory variables, African-Americans, Native Americans and Hispanics had a 5.5, 3.7 and 2.1 estimated mean standard score lower than whites respectively. Asian Pacific Islanders, on the other hand, had a 2.4 higher estimated mean standard score compared to whites. Men scored 1.3 points higher compared to females. Self-concept was not significant and was dropped from the model.


I interacted race and future plans on socio-economic status, and locus of control. The only variables that turned out to be significant were the interactions between African-American and socio-economic status, and Native Americans attending college. The interaction between African-Americans and socio-economic status revealed that on average, controlling for other explanatory variables, if socio-economic status increases by one unit, African-American children had an estimated mean standard score rate of increase of 1.8 less compared to whites.

Checking Model Assumptions

I checked for multicollinearity, homoscedasticity, normally distributed errors and linearity between the response variable and the continuous explanatory variables. The models did not violate any of the linear regression assumptions.

2. Logistic Regression

The logistic regression showed that being held back a grade is positively correlated with being male and African-American but negatively correlated with students who plan on attending a higher level education school upon graduating from high school. Being held back was also negatively correlated with an increase in social economic status and locus of control. Controlling for other variables, a one unit increase in socio-economic status reduced the odds of being held back by 47% and a one unit increase in locus of control reduced the odds by 25%. Controlling for other variables, attending education after high school reduced the odds of being held back by 37%. Compared to those students not planning on a post-high school education. Controlling for all other variables, males had a 59% greater odds of being held back compared to females.

All variables were significant at the 5% level. Those variables that were not significant at the 5% level such as self-concept and some of the race variables were dropped from the model. I interacted socio-economic status on the race variables and none of the interactions were significant. After discovering that self concept was not significant, I added quadratic and cubic polynomial variables but they were not significant and were dropped from the model.