afroCoderHanane / cmc-csci145-math166

Data Mining

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

CSCI145 / MATH166: Data Mining

Important links:

  1. Data Mining vs Machine Learning vs Artificial Intelligence vs Statistics
  2. What do data scientists get paid?

About the Instructor

Name Mike Izbicki (call me Mike)
Email mizbicki@cmc.edu
Office Adams 216
Office Hours See Issue #69
Zoom See Issue #70
Webpage https://izbicki.me
Research Machine Learning (see izbicki.me/research.html for some past projects)

Fun facts:

  1. grew up in San Clemente (~1 hr south of Claremont)
  2. 7 years in the navy
    1. nuclear submarine officer, personally converted >10g of uranium into pure energy
    2. worked at National Security Agency (NSA)
    3. left Navy as a conscientious objector
  3. phd/postdoc at UC Riverside
  4. taught in DPRK (i.e. North Korea)

About the Course

General Information:

  1. This is the theory course for CMC's Data Science major
  2. Prepare you for industry or graduate school
    1. Especially for machine learning technical interviews
    2. No SQL in this course => that's CSCI143 Big Data

Learning Objectives:

  1. See the Jupyter notebook

  2. Exposure to research-level data mining

    1. Understand the latest algorithms... but algorithms get outdated fast.

    2. The real goal is to teach you how to read research-level papers and math so that you can understand future techniques by yourself

  3. Major concepts

    1. Techniques
      1. Eigen-methods for data mining
      2. Logistic regression
      3. Kernel methods
      4. Neural networks
      5. word2vec
      6. Small amount of deep learning (transformers, CNNs, etc.)
    2. Math
      1. Bias/variance trade-off
      2. VC Dimension theorem (fundamental theorem of statistical learning)
      3. Regularization (L1, L2, elastic net, weight decay, early stopping, etc.)
      4. Optimization algorithms (gradient descent, stochastic gradient descent, ADAM, etc.)
    3. Programming:
      1. Writing code that is easy to deploy
    4. Focus on text/web/social media examples
  4. Ethical implications of data mining

    Pet peeve: You can't fully understand the ethics if you don't understand the technical details

  5. Apply data mining libraries (PyTorch, scikit-learn, GenSim, spaCy, etc.)

    1. Teaching you how to use these libraries is NOT the primary goal of the course
    2. In-person class time will focus on the math, and I'm expecting you can figure out how to use the libraries on your own

Prerequisite knowledge:

  1. linear algebra
    1. eigenvectors
  2. computation
    1. big-o analysis
    2. git
    3. download/use python libraries
  3. statistics
    1. super basic probability
    2. exposure to linear/logistic regression helpful but not required

Textbook:

I will provide all the reference material for this class. You don't have to buy anything.

  1. Learning from Data by Yaser S. Abu-Mostafa, Malik Magdon-Ismail, and Hsuan-Tien Lin

    I am providing you all a free copy. It is yours to keep forever if you'd like (or you can return it to me at the end of the semester and I'll pass it on to future students). Feel free to highlight/take notes/etc in it as if it were your own book, because it is.

  2. Understanding Machine Learning: From Theory to Algorithms by Shai Shalev-Shwartz and Shai Ben-David

    Freely available from Shalev-Shwartz's website

  3. Lots of research papers / lecture notes

Grades:

Category Percent Approximate Date
Projects 30 Every 2-3 weeks
Quizzes 0
Midterm 1 (Pagerank) 15 Week 03
Midterm 2 (Learning from Data) 15 Week 08
Midterm 3 (Text mining) 15 Week 13
Final 25

Projects:

  1. 4-7 projects

  2. All of them must be completed on the lambda server (i.e. using ssh+bash+vim)

    Lambda server has 80 CPUs + 8 GPUs

  3. I'm expecting almost everyone will get full credit, and these will act as a "grade boost"

Quizzes:

  1. There will be 1 quiz per midterm testing definition memorization.
  2. I will give you the quiz before you take it.
  3. They are not worth any points, but you must get 100% on the quiz or you will fail the class.
  4. Unlimited retakes, but each retake results in a -1% off your final grade.

Midterms:

  1. No programming, only math
  2. Take home, unlimited time, open note
  3. Very hard exams. (Historically, average in the 70s. No curve.)

Final:

  1. Oral exam
  2. The purpose is to help prepare you for interviews.
  3. The last week of class will be dedicated to prep.
  4. The final grade can replace your lowest midterm grade, if that would improve your overall grade in the class.

This is a hard class.

  1. The material is intrinsically hard

    1. Very few people find linear algebra, statistics and programming to ALL be easy subjects, and this class combines them all
    2. There's a reason people who understand this material get paid big salaries at FAANG
  2. You will have to read the required references.

    Not all the material will be covered in lectures, and that's intentional to force you to get practice reading research-level data mining text.

  3. Comments from previous students:

    1. Holy fucking shit this was a hard class. I had no idea there was so much god damned fucking math involved in a CS class. You should warn students about that.

    2. I spent 20+ hours per week on this class, and still only got a B. The class is too hard and you should make it easier.

    Unfortunately, I can't remove the math from this class, and I can't make the class easier. Otherwise, you wouldn't be learning the material needed to pass a technical interview / get a good job / go to grad school.

NOTE: In all of my other courses, I include required reading/watching tasks to learn about CS/DS culture. This course doesn't have these tasks because there is already a LOT of textbook reading that you will have to complete.

Late Work Policy:

You lose 20% on projects for each day late. It is still typically better to submit a correct assignment late than an incorrect one on time.

If you collaborate with other students, you get an automatic 2 day extension on any project.

Collaboration Policy:

You are encouraged to discuss all labs and projects with other students, subject to the following constraints:

  1. you must be the person typing in all code for your assignments, and
  2. you must not copy another student's code.

You may use any online resources you like as references.

Basically, I'm trusting you all to be adults. You are ultimately responsible for ensuring you learn the material! So do what will help you learn best.

WARNING: All material in this class is cumulative. If you work "too closely" with another student on an assignment, you won't understand how to complete subsequent assignments, and you will quickly fall behind. You should view collaboration as a way to improve your understanding, not as a way to do less work.

Accommodations for Disabilities

I've tried to design the course to be as accessible as possible for all students. If you need any further accommodations---even if you don't have an officially recognized disability---please ask.

I want you to succeed and I'll make every effort to ensure that you can.

About

Data Mining


Languages

Language:TeX 84.5%Language:Jupyter Notebook 11.1%Language:Python 4.4%