This exercise uses the College data set from Intro to statistical learning by
Gareth James • Daniela Witten • Trevor Hastie & Robert Tibshirani
It contains a number of variables for 777 different
universities and colleges in the US.
The variables are
Private : Public/private indicator
• Apps : Number of applications received
• Accept : Number of applicants accepted
• Enroll : Number of new students enrolled
• Top10perc : New students from top 10% of high school class
• Top25perc : New students from top 25% of high school class
• F.Undergrad : Number of full-time undergraduates
• P.Undergrad : Number of part-time undergraduates
• Outstate : Out-of-state tuition
• Room.Board : Room and board costs
• Books : Estimated book costs
• Personal : Estimated personal spending
• PhD : Percent of faculty with Ph.D.’s
• Terminal : Percent of faculty with terminal degree
• S.F.Ratio : Student/faculty ratio
• perc.alumni : Percent of alumni who donate
• Expend : Instructional expenditure per student
• Grad.Rate : Graduation rate
This exercise aims to
1. produce some comparative analysis between private and public colleges: e.g tuition, acceptance and graduation rate, %of instructional expenditure as a ratio of tuition, etc
2. demonstrate how to use statistical methods such as Logistic regression, LDA, QDA, KNN
by using the available data to predict schools being public or private
3. Combined with school rating data(which will be scrapped by python from USNews), the available data is used to predict the ratings. Again different models will be tried and compared to come up with the best one.