Part 3 – “Real world” testing
a) Load new test data from the “real world” EmailSamples50000.csv.
b) For each of your models (with the optimised parameters which you have identified in part 2), run your classifier on the EmailSamples50000.csv test data.
c) For each optimised model, produce a confusion matrix and report the following:
i. Sensitivity (the detection rate for actual malware samples)
ii. Specificity (the detection rate for actual non-malware samples)
iii. Overall Accuracy
d) A brief statement which includes a final recommendation on which model to use and why you chose that model over the others.
What to Report
You must do all of your work in R.
1. Submit a single report containing:
a. a brief description of your three selected supervised learning algorithms.
b. For each algorithm:
i. The optimised parameters for the algorithm.
ii. A confusion matrix on the test set of the MalwareSamples.csv data showing the accuracy of the algorithm with the optimised parameters.
iii. A confusion matrix showing the accuracy of the algorithm for the ‘real world’ EmailSamples.csv data
iv. A short description of the accuracy, sensitivity and selectivity of the optimised algorithm when applied to the ‘real world’ data.
c. A short paragraph explaining your chosen algorithm and parameters and why this was chosen over the alternatives. Written in language appropriate for an educated software developer without a background in math.
Note: At the end you will present your findings of 3 algorithms showing 2 confusion matrix tables for each (1 for the MalwareSamples dataset, and 1 for the EmailSamples dataset). You will also present a description of accuracy, sensitivity and selectivity for each of the 3 algorithms.
2. If you use any external references in your analysis or discussion, you must cite
your sources.
Students succeed in their courses by connecting and communicating with an expert until they receive help on their questions
Consult our trusted tutors.