ORF 350 --- Spring 2019

Analysis of Big Data

Basic info

Course description: The amount of data in our world has been exploding and analyzing large data sets is a central challenge in society. This course introduces the statistical principles and computational tools for analyzing big data. Topics include statistical modeling and inference, model selection and regularization, scalable computational algorithms, and more.

The course has two main learning objectives:

Develop a data modeling toolkit (statistical methodology, computational algorithms, understanding of what to apply when)
Become comfortable analyzing data sets

To achieve the latter we will have numerous assignments requiring the analysis of data sets.
These will utilize Jupyter notebooks, with a focus on Python and R.

Prerequisites: ORF 245 (Statistics) and ORF 309 (Probability and Stochastic Systems).

Instructor: Miklos Z. Racz
Lecture time and location: MW 8:30 - 9:50 am, 006 Friend Center
Office hours: M 10:00 am - 12:00 pm, 204 Sherrerd Hall

Teaching Assistants (AIs):

Samy Jelassi, office hours: Tu 4:30 - 6:30 pm, 005 Sherrerd;
Suqi Liu, office hours: W 7:00 - 9:00 pm, 005 Sherrerd;
Thomas Pumir, office hours: F 9:00 - 11:00 am, 005 Sherrerd;

Precepts:

P1: Tu 3:30 - 4:20 pm, Friend 108; Samy Jelassi
P2: Tu 7:30 - 8:20 pm, Friend 009; Suqi Liu
P3: Th 7:30 - 8:20 pm, Sherrerd 001; Thomas Pumir

Grading and course policies

Grading: There will be homework problem sets throughout the semester (approximately weekly), as well as a midterm and a take-home final exam.
Your final score is a combination of your performance in these, with the following breakdown:

HW 50%
midterm 20%
final 30%

Midterm info: Monday, March 11, in class

Final info: Take-home final exam, details posted on Piazza

Homework and collaboration policy:
Please be considerate of the grader and write solutions neatly. Unreadable solutions will not be graded.
Please follow the instructions on the problem set regarding submitting your homework online via Blackboard.
Please write your name, Princeton email, and the names of other students you discussed with on the first page of your HW.
No late homework will be accepted. Your lowest homework score will be dropped.

You should first attempt to solve homework problems on your own.
You are encouraged to discuss any remaining difficulties in study groups of two to four people.
However, you must write up the solutions on your own and you must never read or copy the solutions of other students.
Similarly, you may use books or online resources to help solve homework problems, but you must always credit all such sources in your writeup, and you must never copy material verbatim.

Advice: do the homeworks! The best way to understand the material is to solve many problems and analyze many data sets. In particular, the homeworks are designed to help you learn the material along the way.

Email policy: For questions about the material, please come to office hours.
For general interest questions, please post to the course Piazza page.
This facilitates quick and efficient communication with the class.
Please use email only for emergencies and administrative or personal matters.
Please include "ORF 350" in the subject line of any email about the course.

Resources

Recommended text:

Trevor Hastie, Robert Tibshirani, Jerome Friedman, The Elements of Statistical Learning (Second Edition), 2009.
[ book webpage (including pdf) ]

Piazza: The course has a Piazza page.
Think of this as a Q&A wiki for the course, use it for questions and discussions. For more details, see Piazza.

Schedule

Classes begin on Monday, February 4.

Week 1: Introduction and overview; intro to Jupyter; math review; maximum likelihood estimation
Week 2: Linear regression
Week 3: High-dimensional regression (ridge, lasso)
Weeks 4 & 5: Classification (naive Bayes, support vector machines, logistic regression, decision trees, random forests)
Week 6: Midterm; bagging; boosting
Week 7: Clustering (k-means, hierarchical clustering, spectral clustering)
Week 8: Dimensionality reduction (PCA)
Weeks 9 & 10: Intro to deep learning (convolutional neural networks)
Weeks 11 & 12: Additional topics (depending on time): causal inference, multiple testing problem (false discovery rate), network data
April 24: William Pierson Field Lecture by Wayne Tai Lee titled Data science in classic industries vs digital companies

Note: this plan is subject to change depending on how we progress throughout the semester.

Back to Teaching Home

Miklos Z. Racz

miklos dot racz at northwestern dot edu

ORF 350 --- Spring 2019

Analysis of Big Data