Task 1: Analysing Bank Data
We will be doing some analytics on real data from a Portuguese banking institution. The data is stored in a semicolon (“;”) delimited format.
The data is supplied with the assignment at the following locations:
Small version Full version
Task_1/Data/bank-small.csv Task_1/Data/bank.csv
The data has the following attributes
Here is a small example of the bank data that we will use to illustrate the subtasks below (we only list a subset of the attributes in this example, see the above table for the description of the attributes):
Please note we specify whether you should use [Hive] or [Spark RDD] for each subtask at the beginning of each subtask.
a) [Hive] Report the number of clients of each job category. Write the results to “Task_1a-out”. For the above small example data set you would report the following (output order is not important for this question):
"blue-collar" 1
"entrepreneur" 1
"management" 2
"services" 1
"technician" 3
b) [Hive] Report the average yearly balance for all people in each education category. Write the results to “Task_1b-out”. For the small example data set you would report the following (output order is not important for this question):
"primary" 10.0
"secondary" 286.6666666666667
"tertiary" 1031.3333333333333
"unknown" 1506.0
c) [Spark RDD] Group balance into the following three categories:
a. Low: -infinity to 500
b. Medium: 501 to 1500 =>
c. High: 1501 to +infinity
Report the number of people in each of the above categories. Write the results to “Task_1c-out” in text file format. For the small example data set you should get the following results (output order is not important in this question):
(High,2)
(Medium,2)
(Low,4)
d) [Spark RDD] Sort all people in ascending order of education. For people with the same education, sort them in descending order by balance. This means that all people with the same education should appear grouped together in the output. For each person report the following attribute values: education, balance, job, marital, loan. Write the results to “Task_1d-out” in text file format (multiple parts are allowed). For the small example data set you would report the following:
("primary",10,"technician","married","no")
("secondary",829,"services","divorced","yes")
("secondary",29,"technician","divorced","yes")
("secondary",2,"entrepreneur","single","no")
("tertiary",2143,"management","married","yes")
("tertiary",929,"technician","married","yes")
("tertiary",22,"management","divorced","no")
("unknown",1506,"blue-collar","married","no")
Students succeed in their courses by connecting and communicating with an expert until they receive help on their questions
Consult our trusted tutors.