Q4 OpenRefine
OpenRefine is a Java application and requires Java JRE to run. Download and install Java if you do not have it (you can verify by typing ‘java -version’ in your computer’s terminal or command prompt).
a. Watch the videos on OpenRefine’s homepage for an overview of its features. Then, download and install OpenRefine release 3.3. Do not use version 3.4 (which is in beta status).
b. Import Dataset
● Run OpenRefine and point your browser at 127.0.0.1:3333.
● We use a products dataset from Mercari, derived from a Kaggle competition (Mercari Price
Suggestion Challenge). If you are interested in the details, visit the data description page.
We have sampled a subset of the dataset provided as "properties.csv".
● Choose "Create Project" → This Computer → properties.csv". Click "Next".
● You will now see a preview of the data. Click "Create Project" at the upper right corner.
c. Clean/Refine the data
NOTE: OpenRefine maintains a log of all changes. You can undo changes. Use the "Undo/Redo" button at the upper left corner. Follow the exact output format specified in every part below.
i. Select the category_name column and choose ‘Facet by Blank’ (Facet → Customized Facets → Facet by blank) to filter out the records that have blank values in this column. Provide the number of rows that return True in Q4Observations.txt. Exclude these rows.
Output format and sample values:
i.rows: 500
ii. Split the column category_name into multiple columns without removing the original column. For example, a row with “Kids/Toys/Dolls & Accessories” in the category_name column would be split across the newly created columns as “Kids”, “Toys” and “Dolls & Accessories”. Use the existing functionality in OpenRefine that creates multiple columns from an existing column based on a separator (i.e., in this case ‘/’) and does not remove the original category_name column. Provide the number of new columns that are created by this operation, excluding the original category_name column.
Output format and sample values:
ii.columns: 10
NOTE: There are many possible ways to split the data. While we have provided one way to accomplish this in step ii, some methods could create columns that are completely empty. In this dataset, none of the new columns should be completely empty. Therefore, to validate your output, we recommend that you verify that there are no columns that are completely empty, by sorting and checking for null values.
iii. Select the column name and apply the Text Facet (Facet → Text Facet). Cluster by using (Edit Cells → Cluster and Edit …) this opens a window where you can choose different “methods” and “keying functions” to use while clustering. Choose the keying function that produces the smallest number of clusters under the “Key Collision” method. Click ‘Select All’ and ‘Merge Selected & Close’. Provide the name of the keying function and the number of clusters that was produced.
Output format and sample values:
iii.function: fingerprint, 200
NOTE: Use the default Ngram size when testing Ngram-fingerprint.
v. Replace the null values in the brand_name column with the text “Unknown” (Edit Cells - > Transform). Provide the General Refine Evaluation Language (GREL) expression used.
Output format and sample values:
iv.GREL_categoryname: endsWith("food", "ood")
v. Create a new column high_priced with the values 0 or 1 based on the “price” column with the following conditions: if the price is greater than 90, high_priced should be set as 1, else 0. Provide the GREL expression used to perform this.
Output format and sample values:
v.GREL_highpriced: endsWith("food", "ood")
vi. Create a new column has_offer with the values 0 or 1 based on the item_description column with the following conditions: If it contains the text “discount” or “offer” or “sale”, then set the value in has_offer as 1, else 0. Provide the GREL expression used to perform this. Convert the text to lowercase before you search for the terms.
Output format and sample values:
vi.GREL_hasoffer: endsWith("food", "ood")
Deliverables: Submit the following files to Gradescope:
● properties_clean.csv : Export the final table as a comma-separated values (.csv) file.
● changes.json : Submit a list of changes made to file in json format. Use the “Extract Operation History” option under the Undo/Redo tab to create this file.
● Q4Observations.txt : A text file with answers to parts c.i, c.ii, c.iii, c.iv, c.v, c.vi. Provide each answer in a new line in the exact output format specified. Your file’s final formatting should result in a .txt file that has each answer on a new line followed by one blank line (to help visually separately the answers)
Students succeed in their courses by connecting and communicating with an expert until they receive help on their questions
Consult our trusted tutors.