2. Using the same Titanic dataset from above, you will conduct a clustering analysis on a mix of nominal and interval data types and investigate different distance measures. Use the preprocessed data with the following adjustment:
• convert the Pclass column back to a numeric type and retain each value's corresponding level.
df[,c("Pclass")]=as.numeric(as.character(df [,c("Pclass")]))
• Replace the first level (empty string) in column Embarked with “U”.
levels(df [,c("Embarked")])[1]="U"} . Be sure you have installed the required packages:
library(cluster) #for computing clustering, pam, gower
library(factoextra)#for elegant ggplot2-based data visualization
library(magrittr)#for piping: %>%
library(Rtsne) # for t-SNE plot library(dplyr) # for data cleaning
library(caret)# for one-hot encoding
(a) Using a statistical method (stacked bar chart, correlation, linear regression, or ANOVA), investigate the effect of categorical column Sex on the dependent column Survived. Is there a significant association between these two columns? Why? Explain your reasoning.
(b) A categorical variable consists of discrete values that don't have an ordered relation ship. One-hot encoding is the process of converting a categorical variable into multiple variables, each with a value of 1 or 0. Read this reference one-hot-encoding-in-r and perform a one-hot encoding on the Sex and Embarked columns
(c) Using the get_dist() function from package factoextra, compute the Jaccard dis similarity for the converted nominal columnsSex, Embarked. The formula for the Jaccard co efficient could be found at https://www.ims.uni-stuttgart.de/documents/team/schulte/ theses/phd/algorithm.pdf. Note that the one-hot encoding from above will result in a 6 col umn data frame where the first 2 columns are the results for column Sex and the last 4 columns are the results for column Embarked. Compute the Jaccard dissimlarity for columns Sex and Embarked separately.
(d) Using get_dist() function from package factoextra, compute the Euclidean and Kendall distances for the numeric columns (remember to exclude column Survived).
(e)Choose the optimal number of clusters.
1. Read this article k-medoids and write down one drawback of kmeans clustering that was mentioned
2. Fill the missing parts in the following code, which uses the fviz nbclust function in pack age factoextra to find the optimal number of clusters using a k-medoid clustering from calculating total within sum-of-squares (Elbow method).
##compute the weighted sum of three distance matrices,
weights are equally weighted (each original column occupied 1/8 weight)
my.d = 0.75*d.interval.kd + 0.125*d.sex + 0.125*d.eb
##recombine the numeric and categorical data
my_data = cbind. data.frame(interval.data, nominal.onehot)
fviz_nbclust(x=missing1, FUNcluster -missing2, method -missing3 , diss = missing4)
3. Use one sentence to explain how Elbow method works.
(f) Conduct k-medoid clustering using function pam from package cluster with k = 2, if cluster 1 corresponds to those not survived and cluster 2 corresponds to those survived. Calculate the percentage of correctly assigned people according to your clustering result.
Students succeed in their courses by connecting and communicating with an expert until they receive help on their questions
Consult our trusted tutors.