The model evaluation and selection techniques are the most important

Be Prepared For The Toughest Questions

Practice Problems

Distinction Task 10.1D: Model evaluation metrics

Task description:

The model evaluation and selection techniques are the most important tools in a data scientist’s toolbox. So far, we have introduced many model evaluation methods/metrics, such as GridSearchCV, cross_val_score, confusion matrix, precision, recall and f-score, etc. In reality, classification problems rarely have balanced classes, and often false positives and false negatives have very different consequences. We need to understand what these consequences are, and pick an evaluation metric accordingly, therefore select a right model for the given dataset.

In this task, you are given a dataset “creditcard.csv” used in practical10. Based on the code example provided in practical10, try to find the “best” classification model by comparing the evaluation metrics, especially the recall rates produced by knn, decision tree and random forest models.

You are given:

• Dataset: creditcard.csv

• thresholds = [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]

• Parameter grid (param_grid):

For knn, n_neighbors = [1, 2, 3, 4, 5]

For decision tree, max_depth = [3, 4, 5, 6, 7]

For random forest, n_estimators = [5, 10, 20, 50]

• GridSearchCV(model_classifier(random_state=0), {param: param_grid}, cv=5, scoring='recall')

• Other parameters of your setting

You are asked to:

• use the train and test sets split in practical10 (X_train, X_test, y_train, y_test, and X_train_undersample, X_test_undersample, y_train_undersample, y_test_undersample)

• use Grid search with cross-validation to fit the undersample data with model knn, decision tree and random forest, respectively, set cv=5

• find and print the best parameter for each model (knn, decision tree or random forest) on X_train_undersample dataset

• for each model, build classifier using the found best parameter, predict using test sets (X_test_undersample and X_test), and plot the confusion matrix for the two predictions.

• for each model, plot recall matrices for different threshold for the undersample dataset

• for each model, plot precision-recall curve for the undersample dataset

Note: It is very likely you will find the best parameters found for undersample dataset do not work well for the whole skewed dataset, which is normal. The ideal solution is to use GridSearchCV to find the best parameters for the whole skewed dataset, then use the best parameters to build a new classifier for the whole skewed dataset, however it takes TOO LONG on an office/home laptop/computer due to the size of the whole skewed dataset and amount of resources required. If conditions allow, you are recommended to have a try. In this task, we will mainly play with the undersample dataset.

It is also recommended you define functions for searching best parameters, plotting curves/matrices, etc. as each model will be using similar code to produce the output.

Sample output as shown in the following are for demonstration purposes only. Yours might be different from the provided.

Sample output

Hint

ComputerDecision tree: It is a decision support tool which uses a tree-like model of decisions and their possible consequences, that also includes the chance event outcomes, resource costs, and the utility. It is also the one way to display an algorithm which only contains the conditional control statements.They are commonly used in operations research, specifically in decision analysis, to help i...

Select Deadline for Completion

4 Days

3 Days

2 Days

1 Day

1 to 15 Hours

Know the process

Students succeed in their courses by connecting and communicating with
an expert until they receive help on their questions

Unable to find what you’re looking for?

Consult our trusted tutors.

Ask a Question

Be Prepared For The Toughest Questions

Practice Problems

Related questions

Know the process

Submit Question

Tutor Is Assigned

Receive Help

Unable to find what you’re looking for?