In this question you will use a machine learning model to predict whether a passenger

Be Prepared For The Toughest Questions

Practice Problems

1. In this question you will use a machine learning model to predict whether a passenger on the Titanic would have survived its sinking given a set of observed features. The dataset ('Titanic.csv' avail able on Brightspace) includes information about 891 passengers (each row represents one person), with the following features for each:

where:

• Pclass: ticket class (1 = 1st class (upper), 2 = 2nd class (middle), 3 = 3rd class (lower))

• Sex: passenger's sex

• Age: passenger's age in years

• SibSp: # of siblings/spouses aboard

• Parch: # parents or children aboard

• Fare: ticket fare

• Embarked: port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

• Survived: If passenger survived (1 = survived, 0 = not survived)

Specifically, you will compare the performance of a Naive Bayes classifier (using the e1071 package) and a Decision Tree classifier (using the tree package) for this task by reporting the confusion matrix and the ROC curve (using the ROCR package).

Before you start, make sure you have installed the e1071, tree, and ROCR packages.

STEP 1: Preprocessing the dataset.

(a) Load the provided data into R using the read.csv function. Ensure that the columns Pclass, Sex and Embarked are of class factor (HINT: lapply). Assuming your data frame is named df, show the output of executing the command str(df). Only two lines of R code.

(b) Find the total number of NAs in each column. Then, replace each NA in the Age column (only) by setting them equal to the median of the non-NA values in the column. No more than 2 lines of code.

STEP 2: Partition the dataset into training and testing data.

(c) Create a training set composed of 75% of the rows selected randomly, with a testing set composed of the remaining 25% (use the createDataPartition function in the caret package setting the attribute p to the appropriate proportion, and retrieve the column Resample1 from the result to get the randomly selected indices). No more than 3 lines of code.

STEP 3: Learn the models using the training data.

(d) Use the naive Bayes function in the e1071 package, and learn a classifier that determines if a passenger's survival is “1” or “O”. Only one line of R code.

(e) Use the tree function in the tree package to learn a decision tree classifier to determine a passenger's survival. Only one line of R code.

STEP 4: Evaluate model performance.

(f) Report the Confusion matrix for both trained models. What percentage of the test data was correctly classified for each model? No more than 7 lines of code.

(g) Use the ROCR package to create a single ROC plot showing the two classifiers (Naive Bayes in blue and decision tree in red). Make sure that plots are properly colored and labeled. Based on the Area Under the Curve measure, which of the two classifiers works better for the given data? No more than 10 lines of code.

In the previous steps, some decisions that we made were arbitrary. That raises some questions:

• Is using the median age to replace missing values in the Age column, like we did in step 1, an appropriate choice?

• Why not split the data into 80% for training and the remainder 20% for testing, instead of a 75%/25% combination as in step 2? In the following items, we will consider how to approach these questions by performing some comparison analyses. For that, we will use the Decision Tree method, but the same analyses could be performed for Naive Bayes or any other supervised learning method.

(h) Compare the performance of the classifiers by varying the training/testing data pro portion at 25%/75% and 50%/50% (vs. 75%/25%, which was performed above): No more than 6 lines of code.

1. random selection of 25% of the rows for training, and the other 75% for testing

2. random selection of 50% of the rows for training, and the other 50% for testing

(i) For each of the two new data partitions we need to train a separate decision tree.

(j) Report the confusion matrix and the ROC plot for the model predicted with each of the different partitions: 25%/75% vs. 50%/50% vs. 75%/25% (original partition). What percentage of the test data was correctly classified for each partition level? Which one performed best for each of the tested metrics? Given this analysis, what partition level of the dataset would you pick in this case?

Hint

ComputerNaive Bayes classifiers: These are the family of simple probabilistic classifiers, which with strong independence assumptions between the features are based on the application of the Bayes' theorem. These classifiers are among the simplest Bayesian network models. So, they could achieve the higher accuracy levels when they are coupled with the kernel density estimation. Basically, they are...

Select Deadline for Completion

4 Days

3 Days

2 Days

1 Day

1 to 15 Hours

Know the process

Students succeed in their courses by connecting and communicating with
an expert until they receive help on their questions

Unable to find what you’re looking for?

Consult our trusted tutors.

Ask a Question

Be Prepared For The Toughest Questions

Practice Problems

Related questions

Know the process

Submit Question

Tutor Is Assigned

Receive Help

Unable to find what you’re looking for?