Clustering: Science or Art? Towards Principled Approaches

A NIPS 2009 Workshop, December 11, 2009
The Hilton Whistler Resort & Spa


Organizers:
Shai Ben-David | Ulrike von Luxburg | Avrim Blum | Isabelle Guyon | Robert C. Williamson | Reza Bosagh Zadeh | Margareta Ackerman

Clustering is one of the most widely used techniques for exploratory data analysis. In the past five decades, many clustering algorithms have been developed and applied to a wide range of practical problems. There has also been very exciting theoretical work, proving guarantees for algorithms and developing new frameworks for analysis.

Yet in many ways we are only beginning to understand some of the most basic issues in clustering. While there have been some remarkable successes, we believe more is possible. In particular, work addressing issues that are independent of any specific clustering algorithm, objective function, or specific data generative model, is still in its infancy.

In his famous Turing award lecture, Donald Knuth states about Computer Programming that: "It is clearly an art, but many feel that a science is possible and desirable''. In the case of clustering, we believe that an even better and deeper science than what we currently offer is possible and highly desirable.

Goals of the Workshop

This workshop aims at initiating a dialog between theoreticians and practitioners, aiming to bridge the theory-practice gap in this area. The workshop will be built along three main questions:

  1. FROM THEORY TO PRACTICE:
    Which abstract theoretical characterizations / properties / statements about clustering algorithms exist that can be helpful for practitioners and should be adopted in practice?
  2. FROM PRACTICE TO THEORY:
    What concrete questions would practitioners like to see addressed by theoreticians? Can we identify de-facto practices in clustering in need of theoretical grounding? Which obscure (but seemingly needed or useful) practices are in need of rationalization?
  3. FROM ART TO SCIENCE:
    In contrast to supervised learning, where there is general consensus on how to assess the quality of an algorithm, the frameworks for analyzing clustering are only beginning to be developed and clustering is still largely an art. How can we progress towards a deeper understanding of the space of clustering problems and objectives, including the introduction of falsifiable hypotheses and properly designed experimentation? How could one set up a clustering challenge to compare different clustering algorithms? What could be scientific standards to evaluate a clustering algorithm in a paper?

The workshop will also serve as a follow up meeting to the NIPS 2005 “Theoretical  Foundations of clustering” workshop,  a venue for the different research groups working on these issues to take stock, exchange view points and discuss the next challenges in this ambitious quest for theoretical foundations of clustering.

Text version of Call for Contributions

Schedule

Morning session
7:30 - 8:15 Introduction - Presentations of different views on clustering by the workshop organizers.

8:15 - 9:15 Non-standard Approaches

  • Marcello Pelillo - What is a cluster: Perspectives from game theory (30 min) [pdf]
  • Armen E. Allahverdyan, Aram Galstyan, Greg Ver Steeg - Clustering with prior information (30 min) [pdf]

9:15 - 9:30 Coffee Break

9:30 - 10:30 Evaluating clustering: the human factor and particular applications

  • Joshua M. Lewis - Finding a better k: a psychophysical investigation of clustering (30 min) [pdf]
  • Sajib Dasgupta, Vincent Ng - Single data, multiple clusterings (15 min)[pdf]
  • Nima Aghaeepour, Alireza Hadj Khodabakhshi, Ryan R. Brinkman - Empricial study of cluster evaluation metrics (15 min) [pdf]

10:30 - 11:00 Deepayan Chakrabarti (invited talk) - Clustering applications at Yahoo!

Afternoon session
3:30 - 4:30 Hierarchical Clustering

  • Gunnar Carlsson, Facundo Memoli - Some ideas for formalizing clustering (30min) [pdf]
  • Margareta Ackerman, Shai Ben-David, David Loker - Characterization of linkage based clustering (30 min) [pdf]

4:30 - 4:45 Coffee Break

4:45 - 5:45 Information Theoretic Approaches

  • Joachim M. Buhmann - Information theoretic model selection in clustering (30 min) [pdf]
  • Yevgeny Seldin, Naftali Tishby - PAC-Bayesian approach to formulation of clustering objectives (30 min) [pdf]

5:45 - 6:30 Panel discussion

Accepted Papers

What is a Cluster? Perspectives from Game Theory [pdf]
Marcello Pelillo

Clustering with Prior Information [pdf]
Armen E. Allahverdyan, Aram Galstyan, Greg Ver Steeg

Finding a Better k: A psychophysical investigation of clustering [pdf]
Joshua M. Lewis

Single Data, Multiple Clusterings [pdf]
Sajib Dasgupta, Vincent Ng

An Empirical Study of Cluster Evaluation Metrics using Flow Cytometry Data [pdf]
Nima Aghaeepour, Alireza Hadj Khodabakhshi, Ryan R. Brinkman

Some ideas for formalizing clustering schemes [pdf]
Gunnar Carlsson, Facundo Memoli

A Characterization of Linkage-Based Clustering: An Extended Abstract [pdf]
Margareta Ackerman, Shai Ben-David, David Loker

Information theoretic model selection in clustering [pdf]
Joachim M. Buhmann

A PAC-Bayesian Approach to Formulation of Clustering Objectives [pdf]
Yevgeny Seldin, Naftali Tishby

Opinion Papers

These papers will form a basis for discussion sessions during the workshop

Clustering: Science or Art? [pdf]
Isabelle Guyon, Ulrike von Luxburg, Robert C. Williamson
NIPS 2009 Workshop on Clustering Theory, Vancouver, Canada, December 2009

Thoughts on Clustering [pdf]
Avrim Blum
NIPS 2009 Workshop on Clustering Theory, Vancouver, Canada, December 2009

Organizers

Shai Ben-David is a CS professor at the University of Waterloo, Canada.

Avrim Blum is a professor of CS at Carnegie Mellon University.

Ulrike von Luxburg is a Senior Research Scientist at the Max Plank Institute in Tubingen, Germany.

Isabelle Guyon is an independent engineering consultant, working from California.

Reza Bosagh Zadeh is a graduate student at Carnegie Mellon University.

Margareta Ackerman is a graduate student at the University of Waterloo.

Robert C. Williamson is the Scientific Director of NICTA and a Professor in the Research School of Information Sciences and Engineering at the Australian National University.

This workshop is supported by the PASCAL Network of Excellence.

Contact Organizers

Email: nips09 at clusteringtheory.org