SQLite is a lightweight, serverless, embedded database that can easily handle

Be Prepared For The Toughest Questions

Practice Problems

Q2 SQLite

SQLite is a lightweight, serverless, embedded database that can easily handle multiple gigabytes of data. It is one of the world’s most popular embedded database systems. It is convenient to share data stored in an SQLite database — just one cross-platform file which does not need to be parsed explicitly (unlike CSV files, which have to be parsed).

You will modify the given Q2_SQL.py file by adding SQL statements to it. We suggest that you consider testing your SQL locally on your computer using interactive tools to speed up testing and debugging, such as DB Browser for SQLite (https://sqlitebrowser.org).

NOTE: You must only use a version of Python ≥ 3.7.0 and < 3.8 for this question. This question has been developed, tested for these versions. You must not use any other versions (e.g., Python 3.8). The Autograder is using SQLite release 3.22.

NOTE: Do not modify the import statements, everything you need to complete this question has been imported for you. You may not use other libraries for this assignment.

NOTE: A Sample class has been provided for you to see some sample SQL statements, you can turn off this output by changing the global variable SHOW to False. This must be set to false before uploading to Gradescope.

GTusername - Please update the method GTusername with your credentials

NOTE: For the questions in this section, you must only use INNER JOIN when performing a join between two tables. Other types of joins may result in incorrect results.

a. Create tables and import data.

i. Create two tables (via two separate methods, part_ai_1 and part_ai_2, respectively in Q2_SQL.py) named movies and movie_cast with columns having the indicated data types:

1. movies

1. id (integer)

2. title (text)

3. score (real)

2. movie_cast

1. movie_id (integer)

2. cast_id (integer)

3. cast_name (text)

4. birthday (text)

5. popularity (real)

ii. Import the provided movies.csv file into the movies table and movie_cast.csv into the movie_cast table

1. You will write Python code that imports the .csv files into the individual tables. This will include looping though the file and using the ‘INSERT INTO’ SQL command. Only use relative paths while importing files since absolute/local paths are specific locations that exist only on your computer and will cause the auto-grader to fail.

iii. Vertical Database Partitioning. Database partitioning is an important technique that divides large tables into smaller tables, which may help speed up queries. For this question you will create a new table cast_bio from the movie_cast table (i.e., columns in cast_bio will be a subset of those in movie_cast) Do not edit the movie_cast table. Be sure that when you insert into the new cast_bio that the values are unique. Please read this page for an example of vertical database partitioning.

cast_bio

1. cast_id (integer)

2. cast_name (text)

3. birthday (date)

4. popularity (real)

b. Create indexes. Create the following indexes for the tables specified below. This step increases the speed of subsequent operations; though the improvement in speed may be negligible for this small database, it is significant for larger databases.

i. movie_index for the id column in movies table

ii. cast_index for the cast_id column in movie_cast table

iii. cast_bio_index for the cast_id column in cast_bio table

c. Calculate a proportion. Find the proportion of movies having a score > 50 and that has ‘war’ in the name. Treat each row as a different movie. The proportion should only be based on the total number of rows in the movie table. Format all decimals to two places using printf(). Do NOT use the ROUND() function as it does not work the same on every OS.

Output format and sample value:

7.70

d. Find the most prolific actors. List 5 cast members with the highest number of movie appearances that have a popularity > 10. Sort the results by the number of appearances in descending order, then by cast_name in alphabetical order.

Output format and sample values (cast_name,appearance_count):

Harrison Ford,2

e. Find the highest scoring movies with the smallest cast. List the 5 highest-scoring movies that have the fewest cast members. Sort the results by score in descending order, then by number of cast members in ascending order, then by movie name in alphabetical order. Format all decimals to two places using printf().

Output format and sample values (movie_title,movie_score,cast_count):

Star Wars: Holiday Special,75.01,12

War Games,58.49,33

f. Get high scoring actors. Find the top ten cast members who have the highest average movie scores. Format all decimals to two places using printf().

▪ Sort the output by average score in descending order, then by cast_name in alphabetical order.

▪ Do not include movies with score <25 in the average score calculation.

▪ Exclude cast members who have appeared in two or fewer movies.

Output format and sample values (cast_id,cast_name,average_score):

8822,Julia Roberts,53.00

g. Creating views. Create a view (virtual table) called good_collaboration that lists pairs of actors who have had a good collaboration as defined here. Each row in the view describes one pair of actors who appeared in at least 3 movies together AND the average score of these movies is >= 40.

The view should have the format:

good_collaboration(

cast_member_id1,

cast_member_id2,

movie_count,

average_movie_score)

For symmetrical or mirror pairs, only keep the row in which cast_member_id1 has a lower numeric value. For example, for ID pairs (1, 2) and (2, 1), keep the row with IDs (1, 2). There should not be any “self pair” where the value of cast_member_id1 is the same as that of cast_member_id2.

Remember that creating a view will not produce any output, so you should test your view with a few simple select statements during development. One such test has already been added to the code as part of the auto-grading.

Optional Reading: Why create views?

i. Find the best collaborators. Get the 5 cast members with the highest average scores from the good_collaboration view, and call this score the collaboration_score. This score is the average of the average_movie_score corresponding to each cast member, including actors in cast_member_id1 as well as cast_member_id2. Format all decimals to two places using printf().

• Sort your output by collaboration_score in descending order, then by cast_name alphabetically.

Output format (cast_id,cast_name,collaboration_score):

2,Mark Hamil,99.32

1920,Winoa Ryder,88.32

h. SQLite supports simple but powerful Full Text Search (FTS) for fast text-based querying (FTS documentation). Import movie overview data from the movie_overview.csv into a new FTS table called movie_overview with the schema:

movie_overview

▪ id (integer)

▪ overview (text)

NOTE: Create the table using fts3 or fts4 only. Also note that keywords like NEAR, AND, OR and NOT are case sensitive in FTS queries.

i. Count the number of movies whose overview field contains the word ‘fight’. Matches are not case sensitive. Match full words, not word parts/sub-strings. e.g., Allowed: ‘FIGHT’, ‘Fight’, ‘fight’, ‘fight.’. Disallowed: ‘gunfight’, ‘fighting’, etc.

Output format:

ii. Count the number of movies that contain the terms ‘space’ and ‘program’ in the overview field with no more than 5 intervening terms in between. Matches are not case sensitive. As you did in h(i)(1), match full words, not word parts/sub-strings. e.g., Allowed: ‘In Space there was a program’, ‘In this space program’. Disallowed: ‘In space you are not subjected to the laws of gravity. A program.’, etc.

Output format:

submission.py

Hint

ComputerA database management system refers to a software system that uses a method that is standard to store and organize data. The system is usually designed to manage a database. There are several types of database systems namely centralized, commercialized, distributed, operational, end-user, and relational database....

Select Deadline for Completion

4 Days

3 Days

2 Days

1 Day

1 to 15 Hours

Know the process

Students succeed in their courses by connecting and communicating with
an expert until they receive help on their questions

Unable to find what you’re looking for?

Consult our trusted tutors.

Ask a Question

Be Prepared For The Toughest Questions

Practice Problems

Related questions

Know the process

Submit Question

Tutor Is Assigned

Receive Help

Unable to find what you’re looking for?