Using the same Titanic dataset from above, you will conduct a clustering analysis

Be Prepared For The Toughest Questions

Practice Problems

2. Using the same Titanic dataset from above, you will conduct a clustering analysis on a mix of nominal and interval data types and investigate different distance measures. Use the preprocessed data with the following adjustment:

• convert the Pclass column back to a numeric type and retain each value's corresponding level.

df[,c("Pclass")]=as.numeric(as.character(df [,c("Pclass")]))

• Replace the first level (empty string) in column Embarked with “U”.

levels(df [,c("Embarked")])[1]="U"} . Be sure you have installed the required packages:

library(cluster) #for computing clustering, pam, gower

library(factoextra)#for elegant ggplot2-based data visualization

library(magrittr)#for piping: %>%

library(Rtsne) # for t-SNE plot library(dplyr) # for data cleaning

library(caret)# for one-hot encoding

(a) Using a statistical method (stacked bar chart, correlation, linear regression, or ANOVA), investigate the effect of categorical column Sex on the dependent column Survived. Is there a significant association between these two columns? Why? Explain your reasoning.

(b) A categorical variable consists of discrete values that don't have an ordered relation ship. One-hot encoding is the process of converting a categorical variable into multiple variables, each with a value of 1 or 0. Read this reference one-hot-encoding-in-r and perform a one-hot encoding on the Sex and Embarked columns

(c) Using the get_dist() function from package factoextra, compute the Jaccard dis similarity for the converted nominal columnsSex, Embarked. The formula for the Jaccard co efficient could be found at https://www.ims.uni-stuttgart.de/documents/team/schulte/ theses/phd/algorithm.pdf. Note that the one-hot encoding from above will result in a 6 col umn data frame where the first 2 columns are the results for column Sex and the last 4 columns are the results for column Embarked. Compute the Jaccard dissimlarity for columns Sex and Embarked separately.

(d) Using get_dist() function from package factoextra, compute the Euclidean and Kendall distances for the numeric columns (remember to exclude column Survived).

(e)Choose the optimal number of clusters.

1. Read this article k-medoids and write down one drawback of kmeans clustering that was mentioned

2. Fill the missing parts in the following code, which uses the fviz nbclust function in pack age factoextra to find the optimal number of clusters using a k-medoid clustering from calculating total within sum-of-squares (Elbow method).

##compute the weighted sum of three distance matrices,

weights are equally weighted (each original column occupied 1/8 weight)

my.d = 0.75*d.interval.kd + 0.125*d.sex + 0.125*d.eb

##recombine the numeric and categorical data

my_data = cbind. data.frame(interval.data, nominal.onehot)

fviz_nbclust(x=missing1, FUNcluster -missing2, method -missing3 , diss = missing4)

3. Use one sentence to explain how Elbow method works.

(f) Conduct k-medoid clustering using function pam from package cluster with k = 2, if cluster 1 corresponds to those not survived and cluster 2 corresponds to those survived. Calculate the percentage of correctly assigned people according to your clustering result.

Hint

Computer"ANOVA i.e. the analysis of variance is the tool of analysis which is used in the statistics which is found inside the data set into two parts that splits the observed aggregate variability, that are: 1. systematic factors and 2. random factors. Now, the systematic factors contains a statistical influence on the data set that is given, whereas the random factors does not. Th...

Select Deadline for Completion

4 Days

3 Days

2 Days

1 Day

1 to 15 Hours

Know the process

Students succeed in their courses by connecting and communicating with
an expert until they receive help on their questions

Unable to find what you’re looking for?

Consult our trusted tutors.

Ask a Question

Be Prepared For The Toughest Questions

Practice Problems

Related questions

Know the process

Submit Question

Tutor Is Assigned

Receive Help

Unable to find what you’re looking for?