Skip to main content

End of Chapter Exercises

Chapter 2: Data Manipulation in Python: Pandas, Polars, Dask, and Modin

Objective:

Delve deeper into data manipulation using Python's prominent libraries. Explore the functionalities of Pandas and get a glimpse of alternatives like Polars, Dask, and Modin.

Instructions:

1. Data Cleaning with Pandas:

  • Load a dataset of your choice in your Colab notebook (preferably one from the previous assignment) OR in a new python script.
  • Identify and handle missing values in the dataset.
  • Remove any duplicate rows and columns, if they exist.
  • Clean column names
  • Add either a markdown cell if using a notebook, or comments if using a script, to document the changes made to the dataset.

2. Data Transformation:

  • Create new columns based on existing ones (e.g., if you have a 'birth_date' column, create an 'age' column).
  • Aggregate data using groupby and compute summary statistics.
  • Use pivot tables or cross-tabulations for multi-dimensional analysis.

3. Introduction to Alternative Libraries:

  • Read about Polars, Dask, and Modin in Chapter 2.
  • Load your dataset using Polars and Modin.
  • Compare the load times and write your observations in a markdown cell or in your script.
    • Please see Chapter 2 - Adv Exercises to see how to use the import time module as an example of capturing start time and end time

4. Submission:

  • Create a new GitHub repository named datasci_2_manipulation in your GitHub account.
  • Organize your GitHub repository with the following:
    • A "datasets" folder containing the dataset you worked on.
    • Save your Colab notebook to your GitHub repository.
    • Submit the link to your GitHub repository.

Resources:


Tip: Remember, while Pandas is powerful, it's essential to explore alternative libraries to handle larger datasets efficiently.