Question 4
How well is the data correlated?
The “Iris flower data set” is a well known data set that is used for various machine learning and statistical modeling exercises. You can find some details on this data set within Wikipedia2 . The very first plot on the Wikipedia page, titled “Scatterplot of the data set” is very easily produced in R using the command shown in listing 2.
1 # Generate a scatter plot
2 plot ( iris [ , -5] , pch = 21 , col = 1 , bg = c (2:4) [ iris $ Species ]
3 main = ’Iris Data ( red= setosa , green = versicolor , blue = virginica ’)
4 #
5 # iris [ , -5] selects all but the last column of the iris data set
6 # pch is the shape of the symbol representing a data point
7 # col is the outline color of the symbol
8 # bg is background or fill color of the symbol
9 # main is the title of the plot
Listing 2: R code to generate a scatter plot of the Iris data set
The Iris dataset is provided in the base version of R and is located in the variable called iris; this data set is automatically available. You can look at the first six rows of the data set by using the R command shown in figure 2.
Data contained in the column titled Sepal.Width, can be viewed using the R command shown in figure 3.
(i)
Briefly and in your words, define what a measure of correlation between two variables tells you?
(ii)
Generate a matrix of correlation values for the Iris data set. Answer all of the following:
• State what measure of correlation you chose to calculate
• Provide a basic geometric interpretation of your selected measure of correlation
• Why would one use your selected measure of correlation?
(iii)
State which two different flower measurements found in step (ii), gave the best correlation. Also declare the worst.
(iv) In the case of the flower measurements that provided the best correlation in step (iii), write R code to calculate a confidence interval corresponding to the best correlation. Briefly describe how the code works and what the confidence interval means.
(v)
Write R code to perform a hypothesis test using a shuffling approach. In the case of the flower measurements that provided the worst correlation in step (iii), test whether the population correlation
is different to zero. Briefly describe the code used and what the result means. Show the hypotheses
used and any other relevant details.
Students succeed in their courses by connecting and communicating with an expert until they receive help on their questions
Consult our trusted tutors.