Time Series Analysis of Social Interaction


For this project I conducted a time series analysis of the frequency people visit relatives from 1974 through 2010.   I expected that on average socializing with relatives changed during this time period due to increased distances people are living away from the region they were raised, changes in income and general changing patterns in family dynamics.


Net of all other factors and at any point in time, the analysis revealed that having children increases social time with relatives and living away frome one’s hometown decreases it.

Data and Software

Data: General Social Survey

Software: R

Description of Variables

Dependent Variable

Socializing with Relatives (socrel): Frequency of spending a social evening with relatives. Indexed from 1 to 7, 7=almost daily, 6=several times a week, 5=several times a month, 4=once a month, 3=several times a year, 2=once a year, 1=never.

Independent Variables

Distance (dist): Calculated by taking the absolute value of the difference between region of the United States where respondent was living at the age of 16 and region of interview. Regions include 1. New England, 2. Middle Atlantic, 3. East North Central, 4. West North Central, 5. South Atlantic, 6. East South Atlantic, 7. West South Atlantic, 8. Mountain, 9. Pacific.

Real Income (realinc): Real income of respondent in US dollars.

Children (childs): Number of children respondent has had.

Hours Watch TV (tvhours): On average day, the number of hours a respondent watches TV.

Socializing at Bar (socbar): Frequency of attending a bar. Indexed from 1 to 7, 1=Never, 7=Daily

Marital Status (marital): Marital status, indexed as a continuous variable, 1=Married, 2=Widowed, 3=Divorced, 4=Separated, 5=Never married

Year: From 1974 through 2010. Years were interpolated for 1979, 1981, 1992, and every other year from 1995 through 2009.


Step 1: Correlation between Dependent and Independent Variables

The first step in a time series analysis is to calculate the correlation between the dependent variable (in this case frequency of visiting relatives) and the independent variables.

socrel dist realinc year childs tvhours marital socbar
socrel 1.00 -0.10 0.60 0.62 -0.25 0.11 0.46 0.04
dist -0.10 1.00 0.22 0.49 -0.66 -0.15 0.56 0.22
realinc 0.60 0.22 1.00 0.69 -0.63 -0.48 0.52 -0.04
year 0.62 0.49 0.69 1.00 -0.82 0.00 0.94 0.09
childs -0.25 -0.66 -0.63 -0.82 1.00 0.34 -0.83 -0.05
tvhours 0.11 -0.15 -0.48 0.00 0.34 1.00 0.12 0.14
marital 0.46 0.56 0.52 0.94 -0.83 0.12 1.00 0.00
socbar 0.04 0.22 -0.04 0.09 -0.05 0.14 0.00 1.00

Since socializing at a bar is not significantly correlated with socializing with relatives, it was not included in the model.

Step 2: Plots

It is always a good idea to start an analysis by plotting the association between the dependent and independent variables.


The graph shows a slight decrease in spending time with relatives (r=-0.10). I expected the association to be greater. One reason there may be low association is that 78% of the respondents took the survey in the region in which they lived when they were 16 (for these cases dist=0).



The graph shows that as real average income increases, spending time with relatives increases as well (r=0.60).


Number of Children

The graph shows that as the number of children increase, spending time with relatives decreases. The reason is that as people are consumed with spending time with children, they have less time to spend with relatives (r=-0.25).


Marital Status

The graph shows that as people become more alone, they spend more time visiting relatives (r=0.46).



The graph shows that on average the trend of visiting relatives increases from 1974 to 2010. The frequency of visiting relatives in 1974 on average is 4.6, and it dips to a low of 4.4 in 1989 and then increases steadily from 4.5 in 1998 to just above 4.7 in 2010.


Step 3: The Basic Model

The focus of the analysis was to explore the association between socializing with relatives and distance living away from relatives over time. I ran an initial model with just distance as the independent variable and it was not statistically significant. However, when I ran the model with year as an independent variable, it produced these results:

Coefficients Estimate Std. Error t-value Probability
Intercept -6.720 2.11 -3.18 0.0032**
dist -0.801 0.25 -3.16 0.0033**
year 0.006 0.00 5.40 5.2e-06***
Signif. codes:   0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.0632 on 34 degrees of freedom,Multiple R-squared:   0.466, Adjusted R-squared: 0.435, F-statistic: 14.9 on 2 and 34 DF, p-value: 2.3e-05

The model shows that, net of time, as a person moves one region away from their hometown, spending time with relatives decreases by -0.80. Year is also statistically significant but the effect size is low. A yearly increase shows that socializing with relatives increases the index by 0.006

One of the most important technical issue with time series analysis is serial correlation. If serial correlation exists, the model will be inaccurate – standard errors and p-values may overestimate statistical confidence. Serial correlation can be tested using the Durbin Watson test:

lag Autocorrelation D-W Statistic p-value

1         0.4902       0.7917   0.000

2         0.1859       1.3404    0.026

3         0.2431       1.1725     0.014

D-W Statistic is not close to 2 and p<0.05 for all three lags and therefore there is significant autocorrelation in the model.

Step 4: Multivariable Model

Although the above more simple model had significant serial correlation, I explored adding the remaining independent variables to the model:

Coefficients Estimate Std. Error t-value Probability
Intercept -13.12 3.033e+00 -4.325 0.000139 ***
Income 9.881e-06 3.734e-06 2.646 0.012520 *
Childs 5.298e-01 1.212e-01 4.371 0.000122 ***
Year 8.356e-03 1.469e-03 5.686 2.7e-06 ***
Distance -4.568e-01 2.047e-01 -2.231 0.032811 *
Signif. codes:   0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.04812 on 32 degrees of freedom, Multiple R-squared:   0.7091, Adjusted R-squared: 0.6727, F-statistic:   19.5 on 4 and 32 DF,   p-value: 3.252e-08


The model was tested for multicollinearity using the variable inflation factor (VIF):

The VIF for the final model:

realinc               childs                 year                             dist

1.774162             3.333982             3.933339             1.429554

None of the VIFs are higher than 10 so although there is some co-linearity, it is not that severe.

The Durbin-Watson test showed first order first order serial correlation. I also checked the variables in my model for unit roots using the Dickey Fuller test (with drift) and they all have unit roots at lag order 4 and a couple do not have unit root lag at 0 (income and childs). This indicates that the coefficients in my model (final model above) are not accurate and hence must be corrected with a method such as co-integration. Although these results are an improvement over the more basic model, the model needs to be corrected.

Step 5: Final Model

I first attempted to correct the model using a first difference time series regression but none of the variables were statistical significance (p>>0.05).

The final step was to eliminate serial correlation using the autoregressive moving average model (ARIMA).

Here are the results for the final model:

Coefficients Estimate Probability
Intercept -14.176 0.000146 ***
Income 9.657e-06 0.012520 *
Childs 0.482 0.000147 ***
Year 0.009 2.23e-06 ***
Distance -0.597 0.047811 *
Signif. codes:   0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
sigma^2 estimated as 0.00164: log likelihood = 65.95, aic = -117.


All of the coefficients are statistically significant (p<0.05). For every one more child a person has, spending time with relatives increases by 0.482, net of all other factors and at any point in time. In addition, one region away in distance from their hometown, decreases the spending time with relatives index by 0.597. For income and year the effect size is much smaller. For every $10,000 increase in income, spending time with relatives increases by a mere 9.657e-06, and for every year increases the index by 0.009.


The Ljung-box test was used to test for serial correlation:

data: resid(arima.001)

X-squared = 15.22, df = 20, p-value = 0.7637

This test shows that there is no autocorrelation left (p>>0.05) and the model coefficients are now highly accurate.