EX07 - Data Wrangling


Overview

In Exercise 07, you will write some additional utility functions for wrangling data from a raw CSV data file into representations more conducive for computing. In the process, you will gain experience and comfort working with dict and list data structures.

In order to complete this exercise, you must watch the materials from Tuesday’s lecture, where Kris provides you with the code for several of the necessary functions to complete this assignment.

In this exercise, you will move through a very common first set of steps when working with a new data set:

  1. Read the data
  2. Transform it to be in a “shape” that is easier to work with
  3. Preview and select just the parts of the dataset you are interested in
  4. Run (simple, in this notebook) analyses

The sample data set provided alongside this exercise is police stop data from Durham, as compiled by the Stanford Open Policing Project.

There are two files you will be working in: data_utils.py and data_wrangling.ipynb. The first is where you will be writing your function definitions. You can just copy over your functions from 10/19’s class into this file. The Jupyter notebook will guide you through the exercise and lead you through a simple analysis. All you need to do here is run the cells. You will get some baseline credit just for running them and moving through the steps as instructed.

Assignment Outline

  • head – 30 Points Autograded

  • select – 25 Points Autograded

  • concat – 15 Points Autograded

  • Did you run the notebook cells and save the values of outputs before submission? – 10 Points Handgraded

  • Style, Linting, Typing – 20 Points Autograded

This means anything in the notebook that is not included in this outline was either provided to you in one of Kris’ videos or is optional!

0. Pull the skeleton code

You will find the starter files needed by “pulling” from the course workspace repository. Before beginning, be sure to:

  1. Be sure you are in your course workspace. Open the file explorer and you should see your work for the course. If you do not, open your course workspace through File > Open Recent.
  2. Open the Source Control View by clicking the 3-node (circles) graph (connected by lines) icon in your sidebar or opening the command palatte and searching for Source Control.
  3. Click the Ellipses in the Source Control pane and select “Pull” from the drop-down menu. This will begin the pulling process from the course repository. It should silently succeed.
  4. Return to the File Explorer pane and open the exercises directory. You should see it now contains another directory named ex07. If you expand that directory, you should see the starter file for the three functions you’ll be writing.
  5. If you do not see the ex07 directory, try once more but selecting "Pull From" and select origin in step 2. If you expand that directory, you will see the starter files: data_utils.py and data_wrangling.ipynb. Additionally, you will see a CSV file in the data/ directory with the sample traffic stops from one week in Durham in 2015.

1. Starter Files

Note, this exercise makes use of functions that were first introduced in LS23 on March 8th: CSV file Reading. You will use files defined in data_utils.py from that lessons and additional functions to it in this exercise. This is accidentally referenced as the lecture on “10/19” in part 0.0.

Correction: In Part 1. Displaying Tabular data with the tabulate 3rd Party Library - There’s an incorrect data type in the code cell declaring universities. It should be: dict[str, list[str]]

Your work in this exercise will be completed across two files:

  1. data_utils.py - This is the Python module where you will implement utility functions for working with data.

  2. data_wrangling.ipynb - This is the Jupyter Notebook file that makes use of the utility functions

The descriptions of the functions you will need to implement, as well as example code that makes use of those functions once they are correctly implemented, can be found in the data_wrangling.ipynb file.

If your screen is large enough, you are encouraged to open these files side-by-side in VSCode by dragging the tab of one to the right side of VSCode so that it changes to a split pane view. Closing your file explorer can help give you additional horizontal space.

Be sure to save your work in data_utils.py before reevaluating cells in data_wrangling.ipynb.

Autograding

BEFORE SUBMITTING YOUR ASSIGNMENT FOR SUBMISSION, BE SURE TO SAVE YOUR WORK IN YOUR JUPYTER NOTEBOOK!!!

Login to Gradescope and select the assignment named “EX07 - Data Wrangling”. You’ll see an area to upload a zip file. To produce a zip file for autograding, return back to Visual Studio Code.

If you do not see a Terminal at the bottom of your screen, open the Command Palette and search for “View: Toggle Integrated Terminal”.

To produce a zip file for ex07, type the following command (all on a single line):

python -m tools.submission exercises/ex07

In the file explorer pane, look to find the zip file named “21.mm.dd-hh.mm-exercises-ex07.zip”. The “mm”, “dd”, and so on, are timestamps with the current month, day, hour, minute. If you right click on this file and select “Reveal in File Explorer” on Windows or “Reveal in Finder” on Mac, the zip file’s location on your computer will open. Upload this file to Gradescope to submit your work for this exercise.

Autograding will take a few moments to complete. For this exercise there will be points manually graded for style – using meaningful variable names and snake_case. If there are issues reported, you are encouraged to try and resolve them and resubmit. If for any reason you aren’t receiving full credit and aren’t sure what to try next, come give us a visit in office hours!

Contributor(s): Kaki Ryan, Izzy Ford, Kris Jordan