global basedir http://personalpages.manchester.ac.uk/staff/mark.lunt 
global datadir $basedir/stats/7_binary/data

use $datadir/epicourse, clear
tab hip_p sex, co
* 1.1 Prevalence is 9.84% in men, 15.23% in women
tab hip_p sex, co chi2
* 1.2 The difference in prevalence between men and women is very significant
cs hip_p sex, or
* 1.3 Confidence interval is (1.37, 1.97)
* 1.4 The odds ratio and the relative risk are very similar
* 1.5 Yes, the confidence interval does not contain 0, which is the null hypothesis risk difference
logistic hip_p sex
* 1.6 The odds ratio is exactly the same as that produced by cs
* 1.7 The confidence intervals are the same to 3 decimal places (the methods used to calculate them differ, but generally give very similar results)
egen agegp = cut(age), at(0 30(10)100)
label define age 0 "<30" 30 "30-39" 40 "40-49" 50 "50-59"
label define age 60 "60-69" 70 "70-79" 80 "80-89" 90 "90+", modify
label values agegp age
tab agegp hip_p, chi2
* 2.1 Yes: chi2 is very significant
logistic hip_p age sex 
* 2.2 Yes: p = 0.000
* 2.3 Odds of hip pain increase by 1.03 for each year increase in age
logistic hip_p i.sex##c.age
* 2.4 No: the interaction term i.sex#c.age is not significant (p=0.118)
logistic hip_p sex i.agegp
* 2.5 Odds for a man aged 50-60 are 7.74 times the odds for a man aged less than 30
logistic hip_p age sex 
estat gof
* 3.1 Yes. However, this is not really appropriate, since there are so many covariate patterns. It would be better to use only 10 groups
estat gof, group(10)
* 3.1 In this case, there is evidence that the predicted and observed values differ more than can be explained by random variation
lroc
graph export graph1.eps, replace
logistic hip_p i.agegp sex 
estat gof
estat gof, group(10)
* 3.3 Yes, this model is adequate
lroc
graph export graph2.eps, replace
gen age2= age*age
logistic hip_p age age2 sex 
estat gof, group(10) table
* 3.5 Yes, the coefficient for age2 is highly significant, and there is
* no longer evidence of lack of fit.
lroc
graph export graph3.eps, replace
* 3.6 The area under the curve with this model is similar to that use age
*     as a categorical predictor.
predict p
predict db, dbeta
scatter db p
graph export graph4.eps, replace
* 4.1 No, there are no points that are obvious outliers
* However, there are 4 points that may be worth checking
predict d, ddeviance
scatter d p
graph export graph5.eps, replace
* 4.2 Again, there is no evidence of any outliers
scatter p age
graph export graph6.eps, replace
* 4.3 the two lines are the prevalences in men and women
graph twoway scatter p age || lowess hip_p age if sex == 1 || lowess hip_p age if sex == 0
graph export graph7.eps, replace
* 4.4 the fit is good for men, but fits poorly to women over 80
* The quadratic model is reasonable for men, not women
use $datadir/chd, clear
sort agegrp
by agegrp: egen agemean = mean(age)
by agegrp: egen chdprop = mean(chd)
label var agemean "Mean age"
label var chdprop "Proportion of subjects with CHD"
scatter chdprop agemean
graph export graph8.eps, replace
logistic chd age
* 5.1 Odds ratio is about 1.12 per year
predict p
predict db, dbeta
predict d, ddeviance
scatter db p
graph export graph9.eps, replace
* 5.3 Yes, there is one influential point, with db ~ 0.25
summ db, detail
logistic chd age if db < 0.2
* 5.4 The effect on the odds ratio is small: a very slight increase to 1.13