Stats 306B: Methods for Applied Statistics: Unsupervised Learning

Lester Mackey, Stanford University, Spring 2014

Final Project Guidelines

Your final project will give you experience carrying out applied research in unsupervised learning. An ideal project will begin with a compelling question (Are there multiple molecular subtypes of breast cancer?) or a pernicious problem (Lloyd’s algorithm for k-means is highly susceptible to suboptimal stationary points) and end with a thorough empirical analysis.

Along the way, you will review relevant literature, identify appropriate data sources, select appropriate means of evaluation, and either develop novel methodology for your problem or deploy and comprehensively evaluate existing methodology for your new application.

Your project goal is to make a significant and empirically demonstrable contribution to the resolution of a meaningful problem or question in the world. We would recommend following one of two problem-driven paths. The first path would involve developing new unsupervised learning methodology for an existing problem or application that has no fully satisfactory solution. The second would involve tackling a new problem or application with existing methodology; in this case, you should identify one or more questions without satisfactory answers in your chosen domain and explore how unsupervised methodology can help you answer those questions. You may draw inspiration from particular datasets, but your focus should rest not on the data itself but rather on the questions about the world that you can answer with that data. For an excellent example of unsupervised learning applied creatively to a new domain, see the Nutrition topic area below.

While a substantial theoretical component is not required for this project, many empirically successful approaches are motivated or backed by good theory.

You may work alone or in a group of two; the standards for a group project will be twice as high.

We strongly encourage you to come to office hours to discuss your project ideas, progress, and difficulties with the course staff.

Milestones

Associated with the final project are four deliverables spaced throughout the quarter.

Please use LaTeX and the Neural Information Processing Systems (NIPS) conference paper LaTeX style file to format your report and milestone deliverables.

See the course homepage for the deadline associated with each milestone.

I: Project Proposal

By this first milestone, you should have selected a question or problem of interest, identified relevant data sources, begun exploring the literature surrounding the question, and discussed your ideas with the course staff.

Your project proposal deliverable is a 1/2 - 1 page report describing the question or problem you intend to tackle, why this question is important or interesting, prior work on this problem, what data you intend to use in your analyses, and the principal challenges that you anticipate.

If you would like to receive feedback about particular aspects of your proposal, please indicate this in your submission.

A printed copy of the project proposal should be submitted at the start of lecture on the day of the deadline.

II: Progress Report

By this second milestone, you should have some initial results to share; for example, you may have implemented and evaluated the performance of existing algorithms on your dataset and task of interest, or you may have conducted an initial study with simulated data to better understand the properties of certain methods.

Your progress report deliverable is a write-up of no more than 2 pages (not including references) describing what you have accomplished so far and, briefly, what you intend to do in the remainder of the term. You should be able to reuse the text of this milestone in your final report.

A printed copy of the progress report should be submitted at the start of lecture on the day of the deadline.

III: Poster Presentation

You will present your work in a class poster session at the end of the quarter. Your poster should highlight the problem and its motivation, your approach and principal contributions, the results of your experiments, and any major take-aways for the future. At some point during the session, a representative of the course staff will approach your poster, and you will have only 2 minutes to present the highlights of your work and findings. Please practice your summary in advance.

Your poster should be no more than 48 inches high and no more than 42 inches wide (a common poster size is 30 inches high x 36 inches wide). Any poster style is acceptable. Here is an example poster template (from UC Berkeley) that you may be able to adapt to your own purposes. We will provide poster displays and thumbtacks.

IV: Final Report

Your final project report (not including acknowledgements and references) should be no longer than 8 pages in length and should follow a typical conference style (with abstract, introduction, etc.). The write-up should clearly define your problem or question of interest, review relevant past work, and introduce and detail your approach. A comprehensive empirical evaluation should follow, along with an interpretation of your results. Any elucidation of the theoretical properties of the methods under consideration is also welcome.

If this work was done in collaboration with someone outside of the class (e.g., a professor), please describe their contributions in an acknowledgements section.

The final report PDF file should be emailed, along with a PDF copy of your poster, to the stats306b-spr1314-staff@lists.stanford.edu by the deadline. No hardcopy is needed.

Potential Data Sources

If you are not working with data of your own, you may find this extensive list of publicly available datasets to be a useful resource.

Project Inspiration

Your project problem / topic / application area should be driven by your research interests and may even stem from your existing research efforts.

For inspiration, you might turn to Google Scholar, to recent conference proceedings and journals like

or to any of the topic areas and references listed below.

The list that follows is by no means comprehensive, and your project topic need not be drawn from it.

If you have suggestions for additional topic areas or references to include, please contact the course staff.

Near-optimal initialization

Model selection

Scalable unsupervised learning

Dealing with missing data

Sparse / interpretable unsupervised learning

Small-variance asymptotics for latent variable models

Subspace clustering

Method of moments + spectral decomposition for latent-variable models

Unsupervised feature learning

Contrastive learning

Bayesian nonparametrics

Semisupervised learning

Nutrition

Image and video segmentation

Genomics

Health and medicine

Natural language processing

Computer vision

Graphics

Robotics

Finance