Introduction
For this week’s take-home lab, you will work on the same data set from Week 4/5 Take-Home Labs. You will solve the very same problem studied in this week’s in-class lab on a much larger and more interesting dataset. The data contained in the file UCI_Credit_Card.csv contains 30,000 consumer records with 24 different variables. You can read a detailed description of the different fields at the following website:
https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients
The description from the UCI says marriage should have levels: Marital status (1 = married; 2 = single; 3 = others) However, there are levels (0,1,2,3). You should treat 0 as unknown. the description from the UCI says Education (1 = graduate school; 2 = university; 3 = high school; 4 = others). However, there are levels 1 to 6 for education. Thus here 5 = 6 = unknown. X6-X11: The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above. However, there are many factors that are -2. This is also unknown. So every unknown you should treat them as NA.
Your task is to build the best possible model for predicting whether or not a consumer will default on their credit card payment for the next month (the last column in the dataset).
Assignment
Perform the following tasks:
• Conduct a training/test split of the data, building a 20% held out test dataset
• Fit the best RF model you can (consider feature selection etc.) to the data to predict consumer default.
• Then plot ROC curves for the logistic regression, SVM, KNN, CART, and RF models, and compare their performance.
• Compute the AUC for the logistic regression, SVM, KNN, CART, and RF models, and compare their performance.
• Provide a summary and discussion of your work in written form (.docx or .pdf) that includes the following:
o Q1 Summarize the model/feature selection process you used to fit your RF model
o Q2 Provide a summary of the fitted RF model (i.e. model summary)
o Q3 Provide performance evaluation of the fitted RF model using confusion matrix.
o Q4 How well do you think the fitted RF model to this dataset works?
o Q5 Using ROC curves and AUC, which one of logistic regression, SVM, KNN, CART, and RF models works better with the dataset over all?
Students succeed in their courses by connecting and communicating with an expert until they receive help on their questions
Consult our trusted tutors.