Because the data is collected from different sources, and different sources
Ask Expert

Be Prepared For The Toughest Questions

Practice Problems

Because the data is collected from different sources, and different sources

Part 3: in this part, you need to complete a data transformation task.

The dataset we used for the third assignment on data visualization has many problems in terms of data quality and needs pre-processing. We focus on solving one problem in this assignment. The dataset used for this part is "PRC_data_breach.csv", which contains selected columns in the original data. You need to clean the last column, "state",  in the dataset. 

Problem statement:

Because the data is collected from different sources, and different sources and data contributors enter the state data differently. The problem is that some states are labeled with two-letter codes, some with full state names. For example, as shown in the screenshot, both California and CA are values of the state variable. Although they are the same states, the software interprets them as two distinct values/states. The purpose of this part of the assignment is to pre-process data and change all the values in the state column labeled two-letter codes (also called alpha code) to full state names.


The dataset containing the two-letter codes for states and the corresponding full state names can be downloaded from the following website: https://www23.statcan.gc.ca/imdb/p3VD.pl?Function=getVD&TVD=53971. Please click the link to browse the website to understand the data. The code to scrape the data table into R has been included in your R-file (also listed in the following). Please run the code in the R file to extract the dataset, which will be renamed as "state_code" and loaded to your environment.  

state_code< read_html("https://www23.statcan.gc.ca/imdb/p3VD.pl?Function=getVD&TVD=53971")%>%

  html_nodes("table")%>%

  html_table()%>%

  simplify()%>%

  first()%>%

                clean_names()

Requirements:

Based on the datasets provided, please clean the “PRC_data_breach” dataset by replacing all the values in the state column labeled two-letter codes (also called alpha code) with full state names. Please name your cleaned dataset “PRC_data_breach_cleaned”. In your cleaned PRC_data_breach_cleaned dataset, a new column named "state_cleaned" should be created to replace the old "state" column. The state_cleaned column contains the right values, with the alpha code values been replaced by the state full names. All other variables in the dataset should remain unchanged.

The following task breakdown guides you to finish this data transformation and pre-processing task. The breakdown only provides descriptive guidance. You need to find the correct functions to use in each step, and flexibly handle the details to achieve the required goal. Please write the codes using the Tidyverse approach (i.e, using pipes). You are not restricted to writing only one pipeline for each step, but fewer intermediate results are always desired.

Step 1: Split the dataset into two based on the criteria: the number of characters in the state column is 2. This criterion can help to separate data with and without problems. Store two datasets in the environment as dataset1 and dataset2, respectively. [hint: nchar() is a function that counts the number of characters of a string value.]

Step 2: Assuming that dataset1 is the dataset with the number of characters equal to 2. you need to transform dataset1 so that the values for the state variable are labeled with full state names.  You can accomplish this by joining another dataset state_code. 

Step 3: After you complete step 2, you can combine dataset1 and dataset2. Rename the new dataset as "PRC_data_breach_cleaned". Please watch the condition for combining datasets so that your datasets are combined correctly and also make sure your dataset satisfies the requirements.

Hint
BusinessData transformation is the process of changing data from one format to another, usually from a source system's format to a destination system's needed format. Most data integration and data management operations, such as data wrangling and data warehousing, need data transformation. ...

Know the process

Students succeed in their courses by connecting and communicating with
an expert until they receive help on their questions

1
img

Submit Question

Post project within your desired price and deadline.

2
img

Tutor Is Assigned

A quality expert with the ability to solve your project will be assigned.

3
img

Receive Help

Check order history for updates. An email as a notification will be sent.

img
Unable to find what you’re looking for?

Consult our trusted tutors.

Developed by Versioning Solutions.