Remove all rows which reference infrequently occurring words from docwords
Ask Expert

Be Prepared For The Toughest Questions

Practice Problems

Remove all rows which reference infrequently occurring words from docwords

Task 4: Creating co-occurring words from Bag of Words data

Using the same data as that for task 3 perform the following subtasks:

a) [spark SQL] Remove all rows which reference infrequently occurring words from docwords. Store the resulting dataframe in Parquet format at “../frequent_docwords.parquet” and in CSV format at “Task_4a-out”. An infrequently occurring word is any word that appears less than 1000 times in the entire corpus of documents. For the small example input file the expected output is:

3,1,1200

3,2,702

3,3,600

5,3,2000

5,2,200

1,1,1000

1,3,100

b) [spark SQL] Load up the Parquet file from “../frequent_docwords.parquet” which you created in the previous subtask. Find all pairs of frequent words that occur in the same document and report the number of documents the pair occurs in. Report the pairs in decreasing order of frequency. The solution may take a few minutes to run.

- Note there should not be any replicated entries like

o (truck, boat) (truck, boat)

- Note you should not have the same pair occurring twice in opposite order. Only one of the following should occur:

o (truck, boat) (boat, truck)

Save the results in CSV format at “Task_4b-out”. For the example above, the output should be as follows (it is OK if the text file output format differs from that below but the data contents should be the same):

boat, motorbike, 2

motorbike, plane, 2

boat, plane, 1

For example the following format and content for the text file will also be acceptable (note the order is slightly different, that is also OK since we break frequency ties arbitrarily):

(2,(plane, motorbike))

(2,(motorbike, boat))

(1,(plane, boat))

Hint
Computera) A DataFrame is an information structure that coordinates information into a 2-layered table of lines and sections, similar to a bookkeeping sheet. DataFrames are quite possibly the most well-known datum structures utilized in present-day information examination since they are an adaptable and natural approach to putting away and working with information...

Know the process

Students succeed in their courses by connecting and communicating with
an expert until they receive help on their questions

1
img

Submit Question

Post project within your desired price and deadline.

2
img

Tutor Is Assigned

A quality expert with the ability to solve your project will be assigned.

3
img

Receive Help

Check order history for updates. An email as a notification will be sent.

img
Unable to find what you’re looking for?

Consult our trusted tutors.

Developed by Versioning Solutions.