A classification model's misclassification rate on the validation data is a better measure

Be Prepared For The Toughest Questions

Practice Problems

Exercises

1. A classification model's misclassification rate on the validation data is a better measure of the model's predictive ability on new (unseen) data than its misclassification rate on the training data. Explain whether this statement is accurate and why that is so.

2. The first step in data mining procedures according to SAS and IBM/SPSS is to "sample" the data. Sampling here refers to dividing the data available for analysis into at least two parts: a training data set and a validation data set. Why do both SAS and IBM/SPSS recommend this as a first step? What are the risks of ignoring this procedural requirement?

3. How do "Structured" and "unstructured" data differ? Which is the more prevalent form of data? How would the following be classified: numbers in an Excel spread sheet, a thousand text files. a thousand video images, and a thousand audio files?

4. In the Universal Bank classification model estimated with XL Miner". the software produced the validation data set lift chart shown.

How is the naive model displayed in this diagram? What does the other line in the model represent?

5. Some data mining algorithms work so well that they have a tendency to overfit the training data. What does the term overfit mean, and what difficulties does overlooking it cause for the data scientist?

6. The validation data set confusion matrix for the Universal Bank data classification model is shown.

How many records were in the validation data set? How many of these records were correctly classified by the algorithm? How many records were incorrectly classified? What is the "misclassification rate" for the entire validation data set? Would you predict that the misclassification rate for the training data set would be higher or lower on average than the rate you calculated for the entire validation data set?

7. Show the computation for the misclassification rate of this confusion matrix.

8. In the Universal Bank data in this chapter, only 10 percent of the records represented customers who had taken out a personal loan (the target variable). If we were to score a new customer based upon the attributes we used in the algorithm, we would be accurate in the prediction about 90 percent of the time if we always scored the individual as "not accepting a personal loan" because that indeed is what most customers have done in the past. Why not accept being correct 90 percent of the time with this very simple decision rule?

9. Data has the characteristic of "nonrivalry." What is nonrivalry and why is it important to realize that data has this characteristic?

10. The lift chart and the confusion matrix are both standard diagnostic tools used to evaluate a data mining algorithm. Don't the two measures display the same information? Explain any differences between the two measures.

chapter-8

Hint

StatisticsA confusion matrix is also known as the error matrix, in the field of machine learning and specifically the problem of statistical classification. A confusion matrix is a table which is often used to describe the performance of a classification model or the 'classifier' on a set of test data for which the true values are known and allows the visualization of the performance of an algorit...

Select Deadline for Completion

4 Days

3 Days

2 Days

1 Day

1 to 15 Hours

Know the process

Students succeed in their courses by connecting and communicating with
an expert until they receive help on their questions

Unable to find what you’re looking for?

Consult our trusted tutors.

Ask a Question

Be Prepared For The Toughest Questions

Practice Problems

Related questions

Know the process

Submit Question

Tutor Is Assigned

Receive Help

Unable to find what you’re looking for?