global basedir http://personalpages.manchester.ac.uk/staff/mark.lunt 
global datadir $basedir/stats/6_LinearModels2/data

sysuse auto, clear
regress weight foreign
* 1.1 foreign vehicles are, on average, 1000 lbs lighter than US vehicles
* The difference is significant, p = 0.000
regress weight i.foreign
* 1.2 This makes no difference at all
ttest weight, by(foreign)
* 1.3 the mean difference and standard error are exactly the same 
* (except for the minus sign)
graph box weight, over(foreign)
graph export graph1.eps, replace
* 1.4 There is a wider spread of weights for Domestic cars compared to Foreign cars, i.e. greater variance
by foreign: summ weight
* 1.5 the SD is much higher for Domestic (~700) compared to Foreign (~430)
hettest
* 1.6 The difference in variance is significant. Therefore, a linear model is inappropriate


use $datadir/soap, clear
graph box appearance, over(operator)
graph export graph2.eps, replace

* 1.7 Operator 3 has the highest scores: 25% of scores are above 9
sort operator
by operator: summ appearance
regress appearance i.operator
* 1.9 Yes: Prob > F = 0.0000 is testing the null hypothesis that all operators are the same.
* 1.10 p= 0.0000
* 1.11 Operator 1 is the baseline: there is no line for operator 1
lincom _cons + 2.operator
* 1.12 This is the same as we have already seen
lincom 2.operator - 3.operator
* 1.13 Yes: t = -6.04, p= 0.000
use $datadir/cadmium, clear
scatter capacity age
graph export graph3.eps, replace

regress capacity age
* 2.2 The regression coefficient for age is negative, showing that capacity decreases as age increases. 
gen cap1 = capacity if exposure == 1
gen cap2 = capacity if exposure == 2
gen cap3 = capacity if exposure == 3
scatter cap1 cap2 cap3 age
graph export graph4.eps, replace

regress capacity i.exposure
* 2.3 Its borderline, p = 0.09
regress capacity age i.exposure
testparm i.exposure
* 2.4 There are now no significant differences between groups
predict ppred, xb
gen ppred1 = ppred if exposure == 1
gen ppred2 = ppred if exposure == 2
gen ppred3 = ppred if exposure == 3
scatter cap1 cap2 cap3 age || line ppred1 age || line ppred2 age || /* 
*/      line ppred3 age
graph export graph5.eps, replace

regress capacity i.exposure##c.age
testparm i.exposure#c.age
* 2.5 Yes, the slopes in the different exposure groups are different
predict ipred, xb
gen ipred1 = ipred if exposure == 1
gen ipred2 = ipred if exposure == 2
gen ipred3 = ipred if exposure == 3
scatter cap1 cap2 cap3 age || line ipred1 age || line ipred2 age || /* 
*/      line ipred3 age
graph export graph6.eps, replace

* 2.6 The least steep is in the baseline (least exposed group)
* The steepest is in the most exposed group
lincom age + 3.exposure#c.age
use $datadir/hald, clear
sw regress y x1 x2 x3 x4, pe(0.05)
* 3.1 x1 & x4 are retained
sw regress y x1 x2 x3 x4, pr(0.05)
* 3.2 This time x1 & x2 are retained
sw regress y x1 x2 x3 x4, pe(0.05) pr(0.0500005)
* 3.3 This is the same as the backwards model
corr x*
* 3.4 Correlation between x2 & x4 is -0.97
* 3.5 x2 & x4 are very strongly correlated: they contain the same information, so they are largely interchangeable
regress y x1 x2 x3 x4
* 3.6 The F statistic says that the model is very highly  significant: the null hypothesis that all coefficients are 0 could not have given rise to this data
* 3.7 98% of the variance is explained
* 3.8 None of the coefficients are significant, due to the strong correlations between them

use $datadir/growth, clear
scatter weight week
graph export graph7.eps, replace

* 4.1 The line does not look quite straight: there appears to be some curvature
regress weight week
cprplot week
* 4.2 There is definitely curvature around the line
gen week2 = week * week
regress weight week week2
* 4.3 week2 is very highly significant (p = 0.000)
predict pred2, xb
twoway scatter weight week || line pred2 week
graph export graph8.eps, replace

* 4.4 Curved predictor fits the data very well
gen week3 = week2*week
regress weight week week2 week3
* 4.5 week3 is not significant
corr week*
* 4.6 Correlation between week and week2 is 0.97