Briefly and in your words, define what a measure of correlation between

Be Prepared For The Toughest Questions

Practice Problems

Question 4

How well is the data correlated?

The “Iris flower data set” is a well known data set that is used for various machine learning and statistical modeling exercises. You can find some details on this data set within Wikipedia2 . The very first plot on the Wikipedia page, titled “Scatterplot of the data set” is very easily produced in R using the command shown in listing 2.

1 # Generate a scatter plot

2 plot ( iris [ , -5] , pch = 21 , col = 1 , bg = c (2:4) [ iris $ Species ]

3 main = ’Iris Data ( red= setosa , green = versicolor , blue = virginica ’)

4 #

5 # iris [ , -5] selects all but the last column of the iris data set

6 # pch is the shape of the symbol representing a data point

7 # col is the outline color of the symbol

8 # bg is background or fill color of the symbol

9 # main is the title of the plot

Listing 2: R code to generate a scatter plot of the Iris data set

The Iris dataset is provided in the base version of R and is located in the variable called iris; this data set is automatically available. You can look at the first six rows of the data set by using the R command shown in figure 2.

Data contained in the column titled Sepal.Width, can be viewed using the R command shown in figure 3.

(i)

Briefly and in your words, define what a measure of correlation between two variables tells you?

(ii)

Generate a matrix of correlation values for the Iris data set. Answer all of the following:

• State what measure of correlation you chose to calculate

• Provide a basic geometric interpretation of your selected measure of correlation

• Why would one use your selected measure of correlation?

(iii)

State which two different flower measurements found in step (ii), gave the best correlation. Also declare the worst.

(iv) In the case of the flower measurements that provided the best correlation in step (iii), write R code to calculate a confidence interval corresponding to the best correlation. Briefly describe how the code works and what the confidence interval means.

(v) Write R code to perform a hypothesis test using a shuffling approach. In the case of the flower measurements that provided the worst correlation in step (iii), test whether the population correlation is different to zero. Briefly describe the code used and what the result means. Show the hypotheses used and any other relevant details.

Hint

The system of relationship between two variables is determined by the correlation measure. The correlation measure tells us about this by showing if one variable increases, does so too for the other? If they both increase in tandem, then such an outcome would be called positive and it indicates that there's strong support to say these are not just chance correlations but evidence of some underlyin...

Select Deadline for Completion

4 Days

3 Days

2 Days

1 Day

1 to 15 Hours

Know the process

Students succeed in their courses by connecting and communicating with
an expert until they receive help on their questions

Unable to find what you’re looking for?

Consult our trusted tutors.

Ask a Question

Be Prepared For The Toughest Questions

Practice Problems

Related questions

Know the process

Submit Question

Tutor Is Assigned

Receive Help

Unable to find what you’re looking for?