ORF 350 --- Spring 2019
Analysis of Big Data
Basic info
Course description: The amount of data in our world has been exploding and analyzing large data sets is a central challenge in society. This course introduces the statistical principles and computational tools for analyzing big data. Topics include statistical modeling and inference, model selection and regularization, scalable computational algorithms, and more.
The course has two main learning objectives:
- Develop a data modeling toolkit (statistical methodology, computational algorithms, understanding of what to apply when)
- Become comfortable analyzing data sets
These will utilize Jupyter notebooks, with a focus on Python and R.
Prerequisites: ORF 245 (Statistics) and ORF 309 (Probability and Stochastic Systems).
Instructor: Miklos Z. Racz
Lecture time and location: MW 8:30 - 9:50 am, 006 Friend Center
Office hours: M 10:00 am - 12:00 pm, 204 Sherrerd Hall
Teaching Assistants (AIs):
- Samy Jelassi, office hours: Tu 4:30 - 6:30 pm, 005 Sherrerd;
- Suqi Liu, office hours: W 7:00 - 9:00 pm, 005 Sherrerd;
- Thomas Pumir, office hours: F 9:00 - 11:00 am, 005 Sherrerd;
- P1: Tu 3:30 - 4:20 pm, Friend 108; Samy Jelassi
- P2: Tu 7:30 - 8:20 pm, Friend 009; Suqi Liu
- P3: Th 7:30 - 8:20 pm, Sherrerd 001; Thomas Pumir
Grading and course policies
Grading: There will be homework problem sets throughout the semester (approximately weekly), as well as a midterm and a take-home final exam.
Your final score is a combination of your performance in these, with the following breakdown:
- HW 50%
- midterm 20%
- final 30%
Final info: Take-home final exam, details posted on Piazza
Homework and collaboration policy:
Please be considerate of the grader and write solutions neatly. Unreadable solutions will not be graded.
Please follow the instructions on the problem set regarding submitting your homework online via Blackboard.
Please write your name, Princeton email, and the names of other students you discussed with on the first page of your HW.
No late homework will be accepted. Your lowest homework score will be dropped.
You should first attempt to solve homework problems on your own.
You are encouraged to discuss any remaining difficulties in study groups of two to four people.
However, you must write up the solutions on your own and you must never read or copy the solutions of other students.
Similarly, you may use books or online resources to help solve homework problems, but you must always credit all such sources in your writeup, and you must never copy material verbatim.
Advice: do the homeworks! The best way to understand the material is to solve many problems and analyze many data sets. In particular, the homeworks are designed to help you learn the material along the way.
Email policy: For questions about the material, please come to office hours.
For general interest questions, please post to the course Piazza page.
This facilitates quick and efficient communication with the class.
Please use email only for emergencies and administrative or personal matters.
Please include "ORF 350" in the subject line of any email about the course.
Resources
Recommended text:
-
Trevor Hastie, Robert Tibshirani, Jerome Friedman, The Elements of Statistical Learning (Second Edition), 2009.
[ book webpage (including pdf) ]
Think of this as a Q&A wiki for the course, use it for questions and discussions. For more details, see Piazza.
Schedule
Classes begin on Monday, February 4.
- Week 1: Introduction and overview; intro to Jupyter; math review; maximum likelihood estimation
- Week 2: Linear regression
- Week 3: High-dimensional regression (ridge, lasso)
- Weeks 4 & 5: Classification (naive Bayes, support vector machines, logistic regression, decision trees, random forests)
- Week 6: Midterm; bagging; boosting
- Week 7: Clustering (k-means, hierarchical clustering, spectral clustering)
- Week 8: Dimensionality reduction (PCA)
- Weeks 9 & 10: Intro to deep learning (convolutional neural networks)
- Weeks 11 & 12: Additional topics (depending on time): causal inference, multiple testing problem (false discovery rate), network data
- April 24: William Pierson Field Lecture by Wayne Tai Lee titled Data science in classic industries vs digital companies
Back to Teaching Home