Data cleaning


Introduction

This is an exercise about how to clean data using the Python library pandas! Data cleaning is vitally important - you can have great skill in manipulating data, but you won’t get any good results unless the data you’re working with is in a suitable form. You will NEED TO VISIT THIS LINK to learn about data cleaning in pandas. This is a tutorial notebook file meant to show you function and syntax for data cleaning. If you’re uncomfortable with pandas itself, check out the pandas tutorial and exercise on the Explore tab of the course website.

If you run into trouble, please email Jesse (jessew13@email.unc.edu) and Kyle (kjs20@email.unc.edu)!

Task 0:

Import pandas and numpy. Download this file and save it to the data directory. Save the path of the file as a named constant. Using read_csv, read the csv file into a pandas DataFrame. Use the head function to get an idea of the data (you can also open up the file with Excel).

Task 1:

Check that the “id” values are all unique first, and let those values index the DataFrame. Set the optional inplace parameter to True to cause the changes to be made directly to the DataFrame. Use head to check that the DataFrame is now indexed by “id”.

Task 2:

Assume there is no use for the indices_to_drop, latitude, and longitude columns - drop them (set the inplace optional parameter to True). How can you check that you succeeded?

Task 3:

Notice that we have now indexed the DataFrame by the numbers in the id column, but they’re out of order. Look up how to sort the rows of a pandas DataFrame by a column and do so (remember to apply changes inplace). Try to find a suitable website yourself, as it’s a VERY valuable skill to have, but if you need help finding a good website, you can click here

SOS - the resident with ticket id 22209583 has reported a gas leak in their room.

Task 4:

Print all the information associated with ticket id 22209583. Print out only the host_name. Use position-based indexing to find the same information.

Task 5: regex

Extract the rows with host_name values starting only with the letter J. Use this cheat sheet for some useful regex expressions! You will probably also need to Google how to use regex in pandas and Python. Then apply .dropna (remember to set inplace to True) to the dataset. Check your work!

Contributor(s): Jesse Wei, Kyle Sorensen