khsu2000 / nants

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

EECS 189 Team .N.A.N.T.S. Project T Final: Basic Data Cleaning

Learning Objectives

Data cleaning is an essential step in machine learning in applied contexts. In contrast to the idealized datasets students encounter in academic settings, real data can have missing values, be poorly formatted, or contain outliers and otherwise problematic datapoints. From interviews with industry professionals, Team AWGSJ reported that 70% of the work in machine learning is dedicated to data collection, cleaning, and visualization. The goal of our project is to teach students commonly used data preprocessing techniques. These include the following topics:

  • Visualizing data with Seaborn [Optional Review Topic]
  • Normalizing and standardizing numerical data
  • One-hot encoding categorical data
  • Processing text fields with regex
  • Handling imbalanced class distributions
  • Filling null/missing values
  • Removing outliers (without OMP)
  • Image data augmentation
  • Data whitening

Note that visualizing data with Seaborn is an optional topic that can be skipped, since this topic should in principle be covered by another project. This topic has been explicitly marked as optional in the slides, notes, quiz, and assignment. However, since data cleaning and visualization are closely tied together, we have decided to include it as a review topic students can revisit if they choose to.

Directory Navigation

Listed below are our deliverables for this project, as well as a short description accompanying them.

  • Assignments: A directory which contains all assignments created for this course. Each assignment is kept within its own subdirectory and contains two IPython notebooks, one version with the solutions filled in and one blank version for students. The solution version will always have "Sol" as the suffix of the notebook name. In addition, any other required .csv data files are also contained in the directory. All assignments are independent of each other and can be completed in any order, but they all make the assumption that the student has already read the slides and notes. Additionally, there is an Objectives.md file in each directory that describes how each assignment accomplishes a learning objective.
  • Notes: A directory containing a PDF file of the notes, as well as a zip file with the LaTeX source.
  • Slide-Deck: A directory containing a PDF file of the slide-deck, as well as a .pptx file for an editable version of the slides.
  • Quiz Questions: A directory containing two PDF files, one with just the quiz questions and one with both the quiz questions and solutions.

Additionally, we have included a Notes Visuals directory that contains the image files and IPython notebooks that were used to generate visuals used in the slides and notes.

About


Languages

Language:Jupyter Notebook 100.0%