Part 3: in this part, you need to complete a data transformation task.
The dataset we used for the third assignment on data visualization has many problems in terms of data quality and needs pre-processing. We focus on solving one problem in this assignment. The dataset used for this part is "PRC_data_breach.csv", which contains selected columns in the original data. You need to clean the last column, "state", in the dataset.
Problem statement:
Because the data is collected from different sources, and different sources and data contributors enter the state data differently. The problem is that some states are labeled with two-letter codes, some with full state names. For example, as shown in the screenshot, both California and CA are values of the state variable. Although they are the same states, the software interprets them as two distinct values/states. The purpose of this part of the assignment is to pre-process data and change all the values in the state column labeled two-letter codes (also called alpha code) to full state names.
The dataset containing the two-letter codes for states and the corresponding full state names can be downloaded from the following website: https://www23.statcan.gc.ca/imdb/p3VD.pl?Function=getVD&TVD=53971. Please click the link to browse the website to understand the data. The code to scrape the data table into R has been included in your R-file (also listed in the following). Please run the code in the R file to extract the dataset, which will be renamed as "state_code" and loaded to your environment.
state_code< read_html("https://www23.statcan.gc.ca/imdb/p3VD.pl?Function=getVD&TVD=53971")%>%
html_nodes("table")%>%
html_table()%>%
simplify()%>%
first()%>%
clean_names()
Requirements:
Based on the datasets provided, please clean the “PRC_data_breach” dataset by replacing all the values in the state column labeled two-letter codes (also called alpha code) with full state names. Please name your cleaned dataset “PRC_data_breach_cleaned”. In your cleaned PRC_data_breach_cleaned dataset, a new column named "state_cleaned" should be created to replace the old "state" column. The state_cleaned column contains the right values, with the alpha code values been replaced by the state full names. All other variables in the dataset should remain unchanged.
The following task breakdown guides you to finish this data transformation and pre-processing task. The breakdown only provides descriptive guidance. You need to find the correct functions to use in each step, and flexibly handle the details to achieve the required goal. Please write the codes using the Tidyverse approach (i.e, using pipes). You are not restricted to writing only one pipeline for each step, but fewer intermediate results are always desired.
Step 1: Split the dataset into two based on the criteria: the number of characters in the state column is 2. This criterion can help to separate data with and without problems. Store two datasets in the environment as dataset1 and dataset2, respectively. [hint: nchar() is a function that counts the number of characters of a string value.]
Step 2: Assuming that dataset1 is the dataset with the number of characters equal to 2. you need to transform dataset1 so that the values for the state variable are labeled with full state names. You can accomplish this by joining another dataset state_code.
Step 3: After you complete step 2, you can combine dataset1 and dataset2. Rename the new dataset as "PRC_data_breach_cleaned". Please watch the condition for combining datasets so that your datasets are combined correctly and also make sure your dataset satisfies the requirements.
Students succeed in their courses by connecting and communicating with an expert until they receive help on their questions
Consult our trusted tutors.