CS221

Final Project

Due Aug 18th, 11:59pm.

Written by Chris Piech, Jeff Jacobs.

Final Research Project. Formalize a real life problem and apply an AI algorithm.

Your goal for the CS221 final project is to take a real life problem and apply an AI algorithm. We provide a few project ideas of real life problems, or you can solve one of your choosing.

The assumption is that you will start working on the final project Aug 6th and finish on Aug 17th. Having said that we set the due date to Aug 18th at midnight in case there are any last minute problems. The expectations for how much you are supposed to do are based off of how much work you are supposed to put into a 4 unit course over an 11 day period. It comes out to almost the exact same number of hours as a programming assignment. For this project you can work in teams of up to 3 people. The expectation of amount of work scales with number of students in your group.

This project, more than any other assignment in CS221 is a chance to go past what is asked of you. There are two main ways to go above and beyond:

You can design a novel model (either model a new problem, or come up with a new angle on an old problem)
You can implement a solver that was not implemented in class.

Some people will go (way) above and beyond. That does not change the expectations for everyone else.

You are encouraged to kill two birds with one stone (do something that will be useful for you in the future). You will be evaluated on the work you do between today and Aug 18th. You can use any programming language you want (matlab, R, C++, Java are all reasonable). We will only commit to supporting python.

Project Proposal: Submit a project proposal by August 8th at 11:59pm (PDT).

Due Date: The final project is due August 18th at 11:59pm (PDT).

July 30th - Start thinking about your project
Aug 8th - Project description + teams
Aug 13th - Personal milestone
Aug 18th - Final submission (no late days)

Paper with the following sections:

Abstract
Intro
Methods
Results
Conclusion

The format of the paper is up to you, but if you want you can use the ACM templates. Here is a link to the word and latex templates.

We are going to enforce a 5 page limit (not including references).

For the project proposal, please include the following: The aim of your investigation, the group members, how you plan to get your dataset (if applicable) and your initial hypothesis.

Feel free to also include any progress you have made towards your final submission.

Do not fabricate results! You can get full credit for an experiment that doesn't match your hypothesis.

Item	Exceeds Expectations	Meets Expectations	Progressing	Below Expectations	Weight
Justification	A really interesting problem.	A justified problem.	Superficial justification.	No explicit justification.	15%.
Hypothesis	A hypothesis based on previous research.	A reasonable, falsifiable hypothesis.	A vague or non-falsifiable hypothesis.	No explicit hypothesis.	10%.
Model a real world problem	A novel formalization.	A clear and reasonable formalization.	Either unclear or unreasonable.	Both unclear and unreasonable.	40%.
Choose and run an algorithm	Implements the algorithm or uses a hard to set up algorithm.	Choose and run a reasonable algorithm. Give a few sentence description.	Chooses an algorithm, but is unable to get it working.	Selects a non-applicable algorithm.	10%
Evaluate your results	Clever analysis, e.g. bootstrapping or relation to other variables, p-value.	Show numerical results that give an idea of how well your idea worked	Numeric results that do not test your methodology. Non numeric results.	Incorrect.	20%
Propose a next step	Moving conclusion.	Anything reasonable.	Does not incorporate the results of the experiment.	Not explicitly stated.	5%

You can gain up to 5% extra credit or lose up to 5% based on writing quality.
If you have a project that you think doesn't fit our rubrics, let us know.

Here is a list of some Python implementations of algorithms we have discussed in class and/or that you may find useful for your projects. This list is being updated. Please ask the staff for tool recomendations if your problem is not listed here:

Deterministic Search Problem
- Dijkstra's Algorithm
- Bellman-Ford
Markov Decision Problem
- Expectimax: Your code!
- Policy Iteration
CSP
Linear Classifier
- SVM
Clustering
- Hierarchical
- K-Means: Your k-means code!
General
- Scikit Learn
- NLTK

We have developed a few ideas for final projects based on actual research being done by Stanford CS professors, and have culled even more ideas from the web (mostly from University of Edinburgh, University of Illinois, Chicago, and Carnegie Mellon University). If you choose one of these then you won't need to worry about whether the data set you choose is "up to snuff" :)

Joseph Williams, a research in the school of education is particularly interested in working with CS221 students. See Joseph's proposals here.

Sentiment Analysis of IMDB Reviews

Description: This data set was compiled by Stanford Linguistics Professor Chris Potts and Stanford CS PhD student Andrew Maas. It was used for their paper "Learning Word Vectors for Sentiment Analysis." There are 25000 training examples and 25000 test examples, each of which is a textual review of a movie on IMDB. Some preprocessing (namely, bag of words encoding) has already been performed on the raw text for you.
Size
- 25000 training examples, 25000 test examples
- 80 MB compressed
- Approximately 480 MB uncompressed
Task: Examine the reviews to get a feel for what types of words/sentences/phrases tend to indicate positive and negative sentiment. For example, the word "great" could indicate a positive sentiment towards the movie, unless it is part of the sentence "This movie was a great big disaster." Develop a classifier from the training data which tries to predict the score of a review based on its text. As a baseline, compare your classifier's error against the error you would incur if you simply guessed the mean score for every review regardless of its text. Your classifier should be able to greatly exceed this :)
Challenges: Textual parsing has not been taught in this course, but the Stanford NLTK package should contain everything you need to perform simple (and complex if you're ambitious!) processing of the text in this dataset.

USPS Handwritten Digit Recognition

Description: This data set was compiled by a team at SUNY Buffalo for a project sponsored by the US Postal Service, and is described in the paper "A Database for Handwritten Text Recognition Research". There are 4649 training examples and 4649 test examples, in conveniently preprocessed MATLAB (.m and .mat) format.
Size
- 4649 training examples, 4649 test examples
- 8.3 MB compressed
- 18.3 MB uncompressed
Task: Load the training examples into MATLAB via the provided .m file and examine some of the handwritten digits, to get a feel for what they look like. Use the pixel intensity values of the digit images to classify what digit is represented by the image. As a naive baseline, compare your classifier's accuracy with the accuracy you would get by randomly guessing a digit from 0 to 9 regardless of the actual image contents. Again, your classifier should be able to greatly exceed this method's accuracy.
Challenges: We have been using Python for the programming projects so using MATLAB may pose a challenge, but feel free to come to us for help!

Particle Physics Data Set

Description: This data set was used in the KDD Cup 2004 data mining competition. The training data is from high-energy collision experiments. There are 50 000 training examples, describing the measurements taken in experiments where two different types of particle were observed. Each training example has 78 numerical attributes.
Size
- 50 000 training examples, 100 000 test examples
- 78 numerical attributes
- 147 MB as uncompressed text
References:
- KDD Cup 2004: Results and Analysis
- Anti-matter detection: Particle Physics Model for KDD Cup 2004
- KDD Physics Task - Discussion of Modeling Approaches
- Descriptions of more competition entries are available in SIGKDD Explorations 6(2) (see Reports section)
Task: Perform exploratory data analysis to get a good feel for the data and prepare the data for data mining. Train at least two classifiers to distinguish between two types of particle generated in high-energy collider experiments. The original competition asked participants to provide four separate sets of predictions, optimising separately the accuracy, area under the ROC curve, cross-entropy, and q-score. Software to calculate these measures can be downloaded from the competition website.
Challenges: No labels are given to the attributes to help interpret them. There is missing data for 8 of the attributes (with out-of-range values of 999 and 9999 used as placeholders).

Physiological Data Set

Description: This data set was made available for the Physiological Data Modeling Contest at ICML 2004. The data was collected from subjects using BodyMedia wearable body monitors while performing their usual activities. These monitors record acceleration, heat flux, galvanic skin response, skin temperature, and near-body temperature. The training data set includes several sessions for each of multiple subjects, with measurements stored each minute during a session. The test data set includes further sessions from the same subjects, as well as sessions recording measurements from new subjects who did not feature in the training data. Each record in the data includes an annotation code giving information about the kind of activity that the subject was performing at that time. Participants in the competition were asked to train classifiers to apply two of these annotation codes to the test data, and also to train a classifier to identify subjects as men or women (this information is given in the training data sequences).
Size:
- About 10 000 hours of training data, 12 000 hours of test data
- One record per minute in a session
- 16 fields in each record, including 9 fields of physiological data
- 138 MB as uncompressed text
References:
- A Simplified Bayesian Approach for Context and Characteristic Identification Based on Demographic Variables by S. Amin
- PDMC entry by W.-H. Lin and A. Hauptmann
- Bayesian Methods for Diagnosing Physiological Conditions of Human Subjects from Multivariate Times Series Sensor Data by M. Kayaalp
- Whitepapers on more competition entries are available on the PDMC website
Task: Perform exploratory data analysis to get a good feel for the data and prepare the data for data mining. Train at least two different classifiers to detect entries in the test data corresponding to two annotated states in the training data. Train a classifier to predict the gender of the subjects in the test data. (You may wish to focus on only a subset of the predictive tasks.)
Challenges: Only a small proportion of the training data corresponds to the two annotation states of interest, so there are many more negative than positive examples. Much of the data is not annotated (the annotation field contains zero).

Brain-Computer Interface data set

Description: This data set was used in the BCI Competition III (dataset V). Using a cap with 32 integrated electrodes, EEG data were collected from three subjects while they performed three activities: imagining moving their left hand, imagining moving their right hand, and thinking of words beginning with the same letter. As well as the raw EEG signals, the data set provides precomputed features obtained by spatially filtering these signals and calculating the power spectral density.
Size:
- 31216 records in training data, 10464 in test data
- Each record has 96 continuous values and a numerical label
- 63 MB as uncompressed text
Reference:
- On the need for on-line learning in brain-computer interfaces by J. del R. Millán
Task: Perform exploratory data analysis to get a good feel for the data and prepare the data for data mining. Train at least two different classifiers to assign class labels to the test data to indicate which activity the subject was performing while the data were collected.
Challenges: This data set represents time series of EEG readings. A baseline approach could be based on the given precomputed features. It might also be possible to train a classifier on a window of some size around each time step. Both of these approaches ignore the fact that the data is really a time series; one might consider using an explicit time-series model such as a Hidden Markov Model.

Prediction of Gene/Protein Localization Data Set

Description: This dataset was used in the 2001 kdd cup data mining competition. There were in fact two tasks in the competition with this dataset, the prediction of the "Function" attribute, and prediction of the "Localization" attribute. Here we focus on the latter (this is somewhat easier as genes can have many functions, but only one localization, at least in this dataset). The dataset provides a variety of details about the several genes of one particular type of organism. The main dataset, (the downloadable files are Genes_relation.{data,test}) contains row data of the following form:

Gene ID, Essential, Class, Complex, Phenotype, Motif, Chromosome Number, Function, Localization.

The first attribute is a discrete variable corresponding to the gene (there are 1243 gene values). Also the remaining 8 attributes consist of discrete variables, most of them related to the proteins coded by the gene, e.g. the "Function" attribute describes some crucial functions the respective protein is involved in, and the "Localization" is simply the part of the cell where the protein is localized. In addition to the data of the above form, there are also data files (Interactions.relations.{data,test}) which contains information about interactions between pairs of genes.
Size
- Gene_relation files: 6275 examples (4346 training, 1929 test), 9 categorical attributes.
- Interaction_relation files: 1806 records, 2 attributes (one categorical; one numerical)
- 1 MB
References:
- Talk overview slides about this problem and also the winner presentation in the KDD 2001 competition can be found on-line.
- See also Answers to Questions of General Interest from Question Period 1 and Answers to Questions of General Interest from Question Period 2
Task: Perform exploratory data analysis to get a good feel for the data and prepare the data for data mining. The task in this dataset is to make predictions of the attribute "Localization". Compare at least 2 different classifiers. One other possible comparison is to compare performance with or without the use of the interactions data. One possible classifier that handles missing data easily (but does not use the interaction data) could be a belief network that has learned relationships between the Essential, Class, Complex, Phenotype, Motif, Chromosome Number, and Localization attributes.
Challenges: This dataset is a great challenge. From data mining point of view the important challenge is to find a way to efficiently use the Interaction_relation data files, which is not obvious. Another issue is that there is a high proportion of missing variables in the Genes_relation data.

Prediction of Molecular Bioactivity for Drug Design: Binding to Thrombin dataset

Description:This dataset was used in the 2001 kdd cup data mining competition. It was produced by DuPont Pharmaceuticals Research Laboratories and concerns drug design. Drugs are typically small organic molecules. The first step in the discovery of a new drug is usually to identify and isolate the receptor to which it should bind (in this case this is the thrombin site), followed by testing many small molecules for their ability to bind to the target site. Some molecules are able to bind the site, so there are "active" while other remain "inactive". It would be interesting to learn how to separate active from inactive molecules. This dataset provides data of these two classes of drugs (active and inactive).
Size:
- 2545 data points: 1909 for training, 636 for testing
- 139,351 binary attributes, 2 classes
- 694 MB
References:
- Talk overview slides about this problem and also the winner presentation in the KDD 2001 competition can be found on-line.
Task: Carefully read all the information given in kdd cup 2001 compeptition about this data. Perform exploratory data analysis to get a good feel for the data and prepare the data for data mining. The task is to learn a classifier using the training set that predicts the behavior of a drug (active or inactive). Note that the number of attributes is much larger than the number of training examples, thus an efficient classifier should use feature reduction. Train and compare at least two classifiers. You can check your answers on the test set by looking at the corresponding separate file which can be downloaded from the kdd cup 2001 site.
Challenges: This is a difficult data set. Firstly there is a great imbalance between the two class; only 42 examples belong to the active classes from the total 1909 training examples. The larger data mining challenge, however, concers the huge number of binary attributes (139,351). Selecting "good" features will be the most important part of developing an good classifier.

The 4 Universities dataset

Description: This data set contains WWW-pages collected from computer science departments of various universities in January 1997 by the World Wide Knowledge Base (WebKb) project of the CMU text learning group. The 8,282 pages were manually classified into 7 classes: 1) student, 2) faculty, 3) staff, 4) department, 5) course, 6) project and 7) other. For each class the data set contains pages from the four universities: Cornell, Texas, Washington, Wisconsin and 4,120 miscellaneous pages from other universities. The files are organized into a directory structure, one directory for each class. Each of these seven directories contains 5 subdirectories, one for each of the 4 universities and one for the miscellaneous pages. These directories in turn contain the Web-pages.
Size:
- 8,282 webspages, 7 classes
- 60.8 MB
References:
- Text Classification from Labeled and Unlabeled Documents using EM (2000) by Kamal Nigam, Andrew McCallum, Sebastian Thrun and Tom Mitchell.
Task: Prepare the data for mining and perform an exploratory data analysis (these steps will probably not be independent). The data mining task is to classify the texts according to the 7 classes. You should compare at least 2 different classifiers. Since each university's web pages have their own idiosyncrasies, it is not recommended to do training and testing on pages from the same university. We recommend training on three of the universities plus the misc collection, and testing on the pages from a fourth, held-out university (four-fold cross validation). An additional topic might be to look at labelled/unlabelled data, as in the reference.
Challenges: An important challenge from web mining point of view will be the preprocessing of the dataset. Since the data are html files you have to remove all the irrelevant text information, such as html commands etc. and convert the rest of the text into a bag-of-words format. See help on the 4 Universities Data Set web page about doing this with rainbow.

Internet advertisements dataset

Description: These data are from the paper Learning to remove Internet advertisements . The dataset represents a set of possible advertisements on Internet pages. The attributes encode the geometry of the image (if available) as well as phrases occuring in the URL, the image's URL and alt text, the anchor text, and words occuring near the anchor text. There are two class labels: advertisement ("ad") and not advertisement ("nonad"). The interesting about this data is that someone might wish to filter the webpages from irrelevant advertisements, as part of some preproccesing procedure (e.g. useful for subsequent classifcation of the website).
Size:
- 3279 (2821 nonads, 458 ads)
- 1558 attributes (3 continuous, the rest binary)
- 10 MB
References:
- Learning to remove Internet advertisements (1999) by Kushmerick, N..
Task: Prepare the data for mining and perform an exploratory data analysis. The data mining task is to predict whether an image is an advertisement ("ad") or not ("nonad"). As you are not given an explicit training/test split you need to decide on a reasonable way of assessing performance. You should perform feature reduction in order to significantly reduce the number of features. Consider at least two different classifiers.
Challenges: There is an inbalance of the number of data per each class. Also the number of attributes is very high compared to the size of the dataset, which suggests that efficient feature reduction is very important. One or more of the three continuous features are missing in 28% of the data.

The Reuters-21578 text dataset

Description: This is a very often used test set for text categorisation tasks. It contains 21578 Reuters news documents from 1987. They were labeled manually by Reuters personnel. Labels belong to 5 different category classes, such as 'people', 'places' and 'topics'. The total number of categories is 672, but many of them occur only very rarely. Some documents belong to many different categories, others to only one, and some have no category. Over the past decade, there have been many efforts to clean the database up, and improve it for use in scientific research. The present format is divided in 22 files of 1000 documents delimited by SGML tags (here is as an example one of these files). Extensive information on the structure and the contents of the dataset can be found in the README file. In the past, this dataset has been split up into training and test data in many different ways. You should use the 'Modified Apte' split as described in the README file.
Size:
- 21578 documents; according to the 'ModApte' split: 9603 training docs, 3299 test docs and 8676 unused docs.
- 27 MB
References: This is a popular dataset for text mining experiments. The aim is usually to predict to which categories of the 'topics' category class a text belongs. Different splits into training ,test and unused data have been considered. Previous use of the Reuters dataset includes:
- Towards Language Independent Automated Learning of Text Categorization Models (1994) by C. Apte, F. Damerau and S. M. Weiss: This paper tests a rule induction method on the Reuters data. This is where the 'Apte' split of the data was introduced.
- An Evaluation of Statistical Approaches to Text Categorization (1997) by Y. Yang: This paper contains a comparison of 14 different classification methods on 6 different datasets (or at least 6 different splits over 2 datasets).
- Inductive learning algorithms and representations for text categorization (1998) by S. T. Dumais, J. Platt, D. Heckerman and M. Sahami: 5 different learning algorithms for text categorisation are compared. The dataset they use is the 'Modified Apte' split which you will also use.
Task: Carefully read the README file provided by Lewis to get an idea what the data are about. Select the documents as specified in the description of the 'Modified Apte' split. Prepare the data for mining and perform an exploratory data analysis (these steps will probably not be independent). The data mining task is to classify the texts according to the categories in the 'topics' field. You should compare at least 2 different classifiers. An extra task could be document clustering.
Challenges: An important challenge will be the preprocessing of the dataset. The file is delimited by SGML tags, and the text is just plain text format. For any text mining task, this will have to be converted into bag-of-words format. Apart from this, you will have to deal with texts that belong to a varying number of categories. Most classification programs can only take one category per case.

The charitable donations dataset

Description: This dataset was used in the 1998 kdd cup data mining competition. It was collected by PVA, a non-profit organisation which provides programs and services for US veterans with spinal cord injuries or disease. They raise money via direct mailing campaigns. The organisation is interested in lapsed donors: people who have stopped donating for at least 12 months. The available dataset contains a record for every donor who received the 1997 mailing and did not make a donation in the 12 months before that. For each of them it is given whether and how much they donated as a response to this. Apart from that, data are given about the previous and the current mailing campaign, as well as personal information and the giving history of each lapsed donor. Also overlay demographics were added. See the documentation and the data dictionary for more information.
Size:
- 191779 records: 95412 training cases and 96367 test cases
- 481 attributes
- 236.2 MB: 117.2 MB trainin data and 119 MB test data
References:
- A short overview of the kdd cup 1998 results is available on-line, together with short descriptions of the methods used by the winners, and those who came third.
- Learning and Making Decisions When Costs and Probabilities are Both Unknown (2001) by B. Zadrozny and C. Elkan: The authors use this dataset as an example of a situation where misclassification costs depend on the individual. Their approach is more advanced and better described then the ones of the kdd cup winners.
Task: Carefully read the information available about the dataset. Perform exploratory data analysis to get a good feel for the data and prepare the data for data mining. It will be important to do good feature and case selection to reduce the data dimensionality. The data mining task is in the first place to classify people as donors or not. Try at least 2 different classifiers, like for example logistic regression or Naive Bayes. As an extra, you can go on to predict the amount someone is going to give. A good way of going about this is described in Zadrozny and Elkan's paper. The success of a solution can then be assessed by calculating the profits of a mailing campaign targetting all the test individuals that are predicted to give more than the cost of sending the mail. The profits when targetting the entire test set is $10,560.
Challenges: This is definitely not an easy dataset. To start with, some of the attributes have quite a lot of missing values, and there are some records with formatting errors. An important issue is feature selection. There are far too many features, and it will be necessary to select the most relevant ones, or to construct your own features by combining existing ones (the kdd cup winners claim that the secret of their success lies in good feature selection). Also case selection will be important: the training set is huge (95,412 cases), but contains only 5% positive examples. Finally, building a useful model for this dataset is made more difficult by the fact that there is an inverse relationship between the probability to donate and the amount donated.

The Caravan Insurance Data

Description: This dataset was used for the Coil 2000 data mining competition. It contains customer data for an insurance company. The feature of interest is whether or not a customer buys a caravan insurance. Per possible customer, 86 attributes are given: 43 socio-demographic variables derived via the customer's ZIP area code, and 43 variables about ownership of other insurance policies.
Size:
- 9822 records: 5822 training records and 4000 test records
- 86 attributes
- 1.7 MB
References:
- The solution papers of the 29 contestants can be downloaded. Here are local copies of the solutions of the winner of the prediction task and the winner of the description task.
- Magical Thinking in Data Mining: Lessons from CoIL Challenge 2000 by Charles Elkan
Task: The data mining task is to predict whether someone will buy a caravan insurance policy. You should first do some exploratory data analysis. Visualising the data should give you some insight into certain particularities of this dataset. Then prepare the data for data mining. It will be important to select the right features, and to construct new features from existing ones, as is described in the paper of the prediction competition winner. Try out at least 2 different data mining algorithms, and compare the use of mere feature selection with intelligent feature construction. As an extra, you could try to do the second task laid out in the Coil competition: to derive information about the profile of a typical caravan insurance buyer.
Challenges: Like for the kdd cup data, feature selection and extraction will be very important. This can only be done properly after you have spend a considerable amount of time getting to know the data. And also like in the kdd cup data, the data are unbalanced: only 5 to 6% of the customers in the training data set actually buy the insurance policy. There are no missing or noisy data.

The yeast S. cerevisiae gene expression vectors

Description: These are the data from the paper Support Vector Machine Classification of Microarray Gene Expression Data. For 2467 genes, gene expression levels were measured in 79 different situations (here is the raw data set). Some of the measurements follow each other up in time, but in the paper they were not treated as time series (although to a certain extend that would be possible). For each of these genes, it is given whether they belong to one of 6 functional classes (class lables on-line). The paper is concerned with classifying genes in into 5 of these classes (one class is unpredictable). The data contain many genes that belong to other functional classes than these 5, but those are not discernable on the basis of their gene expression levels alone.
Size:
- 2467 genes
- 79 measurements, 6 class labels
- 1.8 MB: 1.7 MB measurement data and 125 KB labels
References:
- Support Vector Machine Classification of Microarray Gene Expression Data (1999) by M. P. S. Brown, W. N. Grundy, D. Lin, N. Cristianini, C. Sugnet, T. S. Furey, M. Ares Jr. and D. Hausslerhref (local copy): This is the original paper from which the data were obtained. It uses SVM's to classify the genes, and compares this to other methods like decision trees. A good description of difficulties with the data can also be found here.
- Cluster analysis and display of genome-wide expression patterns (1998) by M. B. Eisen, P. T. Spellman, P. O. Brown and D. Botstein: This paper describes clustering of genes. The results of this paper showed that the 5 different classes Brown et Al. are trying to predict more or less cluster together. So it indicated that these classes were discernable based on the gene expression levels. This was the basis for the selection of these 5 functional classes for the SVM classification task.
Task: Read the data descriptions in the SVM paper and do exploratory data analysis to understand the characteristics of this dataset. The data mining task is to predict whether a gene belongs to one of the 5 functional classes, based on its expression levels. Try at least two different classification algorithms. The low frequency of the smallest classes will probably pose specific problems. You can also do clustering like performed by Eisen et Al..
Challenges: This dataset is quite noisy and contains a rather high number of missing values. Furthermore, it is very unbalanced: there are only a few positive examples of each of the 5 classes, most of the genes don't belong to any of them. Finally, there are some genes that belong to a certain class, but have different expression levels, and there are genes that don't belong to the class they share prediction level patterns with. These cases will unavoidably lead to false negatives and positives. An overview of these difficult cases can be found in SVM classification paper.

The colon cancer data

Description: This dataset is similar to the yeast gene expression dataset: it contains expression levels of 2000 genes taken in 62 different samples. For each sample it is indicated whether it came from a tumor biopsy or not. Numbers and descriptions for the different genes are also given. This dataset is used in many different research papers on gene expression data. It can be used in two ways: you can treat the 62 samples as records in a high-dimensional space, or you can treat the genes as records with 62 attributes.
Size:
- 2000 genes
- 62 samples
- 1.9 MB data, 529 KB names, 207 bytes labels
References:
- Tissue Classification with Gene Expression Profiles (2000) by A. Ben-Dor, L. Bruhn, N. Friedman, I. Nachman, M. Schummer and Z. Yakhini: This paper describes classification of tissues on the colon cancer and the leukemia (see below) datasets. It also describes how gene selection can be done.
- Coupled Two-Way Clustering Analysis of Gene Microarray Data (2001) by G. Getz, E. Levine and E. Domany: This paper exploits the fact that the gene expression dataset can be viewed in two ways. The authors describe a way of alternating between clustering in the gene domain and in the sample domain. This method should give insight into which genes are defining for sample classifications (and possibly vice versa).
- Gene expression data analysis (2000) by A. Brazma and J. Vilo: An overview of the research in the new domain of microarray data analysis. Much of the work described here makes use of the colon cancer and/or the leukemia dataset.
Task: First perform exploratory data analysis to get familiar with the data and prepare them for mining. The data mining task is to classify samples as cancerous or not. Compare at least two different classification algorithms. You will have to deal with issues arising from the fact that there are many attributes and only a small number of samples. Some classifiers will be more robust to this than others. Some ideas about how to deal with this can be found in the papers refered to above (and the feature selection paper referenced below). As an extra you can perform clustering, in the two different domains (genes and samples). The tissue classification paper describes a way of using clustering for classification: the parameters of the unsupervised learning procedure are defined in a supervised way to make the clusters correspond to classes.
Challenges: The data are quite noisy, due to sample contamination. The real challenge, however, is the shape of the data matrix. When the genes are treated as attributes, the dimensionality of the feature space is very high compared to the number of cases. It will be important to avoid overfitting. Use simple classifiers, or select the most predictive genes. Also, the number of cases is very low, which means that splitting into a training and a test set is not really a good option (although it has been done for the very similar leukemia dataset, as described in the gene expression analysis overview paper and in the feature selection paper referenced below). When combining feature selection with cross-validation, be careful not to use the classifier's test data during the feature selection phase.

The leukemia data set

Description: The leukemia data set contains expression levels of 7129 genes taken over 72 samples. Labels indicate which of two variants of leukemia is present in the sample (AML, 25 samples, or ALL, 47 samples). This dataset is of the same type as the colon cancer dataset and can therefore be used for the same kind of experiments. In fact, most of the papers that use the colon cancer data also use the leukemia data.
Size:
- 72 samples, split into 38 training and 34 test samples
- 7129 genes
- 3.8 MB
References:
- All of the references mentioned above for the colon cancer dataset also use the leukemia data.
- Feature selection for high-dimensional genomic microarray data (2001) by E. P. Xing, M. L. Jordan and R. M. Karp: They describe a three-phase feature selection methods to identify the most predictive genes. They use the division into 38 training and 34 test samples. They find that feature selection works better than regularization.
Task: The task is the same as for the colon cancer data. First perform exploratory data analysis and prepare the data for mining. Then compare at least two different classifiers to identify the kind of leukemia of the sample. Again you will have to deal with problems of high feature dimensionality. You can choose to use the training-test set division the data are presented in, or you can use techniques like cross-validation, as described in the tissue classification paper. Also here, as an extra you can perform clustering in the two different data spaces.
Challenges: The same comments as for the colon cancer dataset can be made: the data are noisy, and the most important challenge is the unusual shape of the data matrix.

The Human Splice Site Data

Description: This dataset contains sequences of human DNA material around splice sites. Gene DNA sequence data contain coding (exons) and non-coding regions (introns). Splice site is the general term for the point at the beginning (donor site) or at the end (acceptor site) of an intron. Donor and acceptor sites typically correspond to certain patterns, but the same patterns can also be found in other places in the DNA sequence. So it is important to learn better classifiers to identify real splice sites. In the past, people have used probability matrices which encode the probability of having a certain nucleotide in a certain position. A disadvantage of this method is that dependencies between positions are not taken into account. Other methods have tried to solve this by building a conditional probability matrix for example, or by using neural networks. To get the best results, many methods don't only use base positions, but also other features, like the presence of certain combinations of nucleotides. Most recently, people have turned to probabilistic models to model the whole gene structure at once. Prediction of splice sites is then helped by the detection of coding and non-coding areas around it (see for example Prediction of complete gene structures in human genomic dna (1997) by burge and carlin). Some information on the problem of genefinding can be found on-line. Information about existing methods can be found in Ficket's overview paper. The dataset presented here contains windows of fixed size around true and false donor sites and true and false acceptor sites.
Size: This dataset is divided along three binary dimensions: acceptor (a) versus donor (d) sites, training (t) versus test (e) data, and true (t) versus false (f) examples.
- 13123 cases, divided as follows: a-e-f: 881 / d-e-f: 782 / a-e-t: 208 / d-e-t: 208 / a-t-f: 4672 / d-t-f: 4140 / a-t-t: 1116 / d-t-t: 1116
- Window length:
  donor data: 15 base positions
  acceptor data: 90 base positions
- 198 KB, divided as follows: a-e-f: 16k / d-e-f: 7k / a-e-t: 7k / d-e-t: 2k / a-t-f: 82k / d-t-f: 36k / a-t-t: 37k / d-t-t: 11k
References:
- The splice site dataset was used to develop the splice site detection part of the the GENIE gene finding program (described in full in A Generalized Hidden Markov Model for the Recognition of Human Genes in DNA (1996) by D. Kulp, D. Haussler, M. G. Reese and F. H. Eeckman). Two different neural networks were used (one for donor and one for acceptor sites).
Task: Perform exploratory data analysis and prepare the data for mining. Develop a classifier for donor sites and one for acceptor sites. Compare at least 2 different classifiers for each. As an extra, you can try to run your classifiers on the Burset and Guigo DNA sequence dataset. This dataset contains full gene dna sequences, together with indications of where coding regions start and stop.
Challenges: The data are well prepared, so building a predictor should be quite straightforward. The best existing predictors use other features than just nucleotide positions. Maybe it is possible to detect and use some of these features to improve the classifier. When testing the classifiers on Burset and Guigo's dna datasets, you will need to make some adaptations.

Volcanoes on Venus

Description: This dataset contains images collected by the Magellan expedition to Venus. Venus is the most similar planet to the Earth in size, and therefore researchers want to learn about its geology. Venus' surface is scattered with volcanos, and the aim of this dataset was to develop a program that can automatically identify volcanoes (from training data that have been labeled by human experts between 1-with 98% probability a volcano- and 4-with 50% probability). A tool called JARtool was developed to do this. The makers of this tool made the data publicly available to allow more research and establish a benchmark in this field. They provide in total 134 images of 1024*1024 8-bit pixels (out of the 30000 images of the original project). The dataset you will use is a preprocessed version of these images: possibly interesting 15*15 pixel frames ('chips') were taken from the images by the image recognition program of JARtool, and each was labeled between 0 (not labeled by the human experts, so definitely not a volcano), 1 (98% certain a volcano) and 4 (50% certainty according to the human experts). More information can be found in the data documentation.
Size: The image chips are spread over groups, according to experiments carried out for the JARtool software. The training and test sets for experiments C1 and D4 together cover all chips (see the experiments table):
- Records: 37280 image chips, divided as follows: C1_trn: 12018 / C1_tst: 16608 / D4_trn: 6398 / D4_tst: 2256
- Features: 15 * 15 pixels
- 8.4 MB
References:
- These data were used for the development of JARtool, a software system that learns to recognize volcanoes in images from Venus. The technical details about this tool are described in the paper Learning to Recognize Volcanoes on Venus (1998) by M. C. Burl, L. Asker, P. Smyth, U. Fayyad, P. Perona, L. Crumpler and J. Aubele. This paper should give you a good example of how data mining can be performed on this dataset (you can ignore the part about Focus of Attention, because that has already been done for you).
Task: Perform Exploratory data analysis. Prepare the data for data mining. Feature space reduction will be necessary, because the number of features is very high compared to the number of positive volcano examples. Then build at least two classifiers to detect volcanoes: implement the basic classifier from Burl et Al.'s paper, and at least one other. You can follow Burl et Al.'s paper, where classes 1 up to 4 are considered positive examples. As an extra, you can try to perform clustering to find the different types of volcanoes as mentioned in Burl et Al.'s paper.
Challenges: It will be necessary to normalise the pixel frames, as there is a difference in brightness between the different images and even between different parts of the same image. Also, feature extraction will be necessary, because there are quite a lot of pixels per frame. This is especially a problem because the dataset is highly unbalanced: the number of positive examples is very low. Finally, there is the fact that the volcanos are of different kinds, and it is difficult to build one classifier for all of them together.

Network Intrusion Data

Description: These data were used for the 1999 kdd cup. They were gathered by Lincoln Labs: nine weeks of raw TCP dump data were collected from a typical US air force LAN. During the use of the LAN, several attacks were performed on it. The raw packet data were then aggregated into connection data. Per record, extra features were derived, based on domain knowledge about network attacks. There are 38 different attack types, belonging to 4 main categories. Some attack types appear only in the test data, and the frequency of attack types in test and training data is not the same (to make it more realistic). More information about the data can be found in the task file, and in the overview of the KDDcup results. On that page, it is also indicated that there is a cost matrix associated with misclassifications. The winner of the KDDcup99 competition used C5 decision trees in combination with boosting and bagging.
Size:
- 8,050,290 records, divided as follows: 4,940,000 training records and 3,110,290 test records. A 10% sample is available for both.
- 41 attributes and 1 label
- 1,173 MB: 743 MB training data and 430 MB test data
References:
- PNrule: A New Framework for Learning Classifier Models in Data Mining (A Case-Study in Network Intrusion Detection) (2000) by R. Agarwal and M. V. Joshi. This paper proposes a new, very simple rule learning algorithm, and tests it on the network intrusion dataset. In the first stage, rules are learned to identify the target class, and then in the second stage, rules are learned to identify cases that were incorrectly classified as positive according to the first rules.
Task: Perform exploratory data analysis and prepare the data for mining. The data mining task is to classify connections as legitimate or belonging to one of the 4 fraud categories. The misclassification costs should be taken into account. Compare at least two different classification algorithms.
Challenges: The amount of data preprocessing needed is quite limited. You will need data reduction to deal with the sheer size of the dataset. The major difficulty, however, is probably the class distribution: while the DoS attack type appears in 79% of the connections, the u2r attack type only appears in 0.01 percent of the records. And this least frequent attack type is in the same time the most difficult to predict and the most costly to miss.

The SuperCOSMOS Sky Survey Objects Catalog

Description: The SuperCOSMOS Sky Survey programme is carried out at the University of Edinburgh.The project used the SuperCOSMOS machine, a high-precision plate scanning facility, to scan in the Schmidt photographic atlas material. This has produced a digitised survey of the entire sky in three colours (B, R and I), with one colour (R) at two epochs. From these digital images, objects have been extracted, and an objects catalogue has been composed. For each object, useful astronomical characteristics have been registered, such as the size, the brightness, the position, etc. A project was then caried out to classify the objects as stars or galaxies. External labeling to evaluate the classification algorithm was obtained from the more precise data of the Sloan Digital Sky Survey.
Size:
- There are 4 object sets, one for B and I, and two for R (one set from pictures taken in the 50's and one set more recent). Each of these is divided in a set of paired objects (for which a corresponding SDSS object was found) and a set of unpaired ones:
  - B-paired: 34663 / B-unpaired: 68987
  - R-paired (recent): 26791 / R-unpaired: 54920
  - I-paired: 15645 / I-unpaired: 41596
  - R-paired (50's): 15834 / R-unpaired: 34426
- Paired datasets have 40 attributes (including some from SDSS), unpaired 34.
- The size of the datasets is as follows:
  - B-paired: 16.4MB / B-unpaired: 23.5MB
  - R-paired (recent): 12.6MB / R-unpaired: 18.7MB
  - I-paired: 14MB / I-unpaired: 7.3MB
  - R-paired (50's): 7.4MB / R-unpaired: 11.7MB
References:
- The SuperCOSMOS Sky Survey - I. Introduction and description (2001) by N. Hambly, H. MacGillivray, M. Read, S. Tritton, E. Thomson, D. Kelly, D. Morgan, R. Smith, S. Driver, J. Williamson, Q. Parker, M. Hawkins, P. Williams and A. Lawrence: This paper is an introduction to the SSS project.
- The SuperCOSMOS Sky Survey. Paper II: Image detection, parameterisation, classification and photometry (2001) by N. Hambly, M. Irwin and H. MacGillivray: A description of the methods for image detection, parameterisation, classification and photometry. A useful paper for you to read, as it gives explanations about how the data were obtained and what they mean, and about the object classification efforts by the SSS people.
- The SuperCOSMOS Sky Survey. Paper III: Astrometry (2001) by N. Hambly, A. Davenhall, M. Irwin and H. MacGillivray: An overview of how the astrometric parameters of the data were derived. Probably less interesting for you.
- Automated Star/Galaxy Classification for Digitized POSS-II (1995) by N. Weir, U. M. Fayyad and S. Djorgovski: This paper uses a similar astronomical dataset. It is quite interesting, as it is much more understandable than paper II above. It uses a similar two-step classification method and should therefore give you some insight in what is happening in paper II.
Task: First read the information in the README file, and in paper II (and the paper by Weir) referenced above. Then perform exploratory data analysis and prepare the data for data mining. You can concentrate on one of the paired datasets. Classify sky objects as stars or galaxies (use the SDSS classification as label). Compare at least two different classification algorithms. Try the effect of excluding/including fields 19 and 31, the classification efforts of the SSS team. Also, do a performance evaluation with respect to the magnitude as was done in paper II.
Challenges: These are astronomical data, and all the documentation is written in 'astronomical language', so it is quite difficult to understand what the data are all about and how the previous research has been caried out. Furthermore, the dataset is quite big, so case reduction might be necessary.

Brain Imaging Data (fMRI)

Description: This data set contains a time series of images of brain activation, measured using fMRI, with one image every 500 msec. During this time, human subjects performed 40 trials of a sentence-picture comparison task (reading a sentence, observing a picture, and determining whether the sentence correctly described the picture). Each of the 40 trials lasts approximately 30 seconds. Each image contains approximately 5,000 voxels (3D pixels), across a large portion of the brain. Data is available for 12 different human subjects.
Idea 1: Bayes network classifiers for fMRI:
Gaussian Naïve Bayes classifiers and SVMs have been used with this data to predict when the subject was reading a sentence versus perceiving a picture. Both of these classify 8-second windows of data into these two classes, achieving around 85% classification accuracy [Mitchell et al, 2004]. This project will explore going beyond the Gaussian Naïve Bayes classifier (which assumes voxel activities are conditionally independent), by training a Bayes network in particular a TAN tree [Friedman, et al., 1997]. Issues youll need to confront include which features to include (5000 voxels times 8 seconds of images is a lot of features) for classifier input, whether to train brain-specific or brain-independent classifiers, and a number of issues about efficient computation with this fairly large data set. Midpoint milestone: By April 12 you should have run at least one classification algorithm on this data and measured its accuracy using a cross validation test. This will put you in a good position to explore refinements of the algorithm, alternative feature encodings for the data, or competing algorithms, by the end of the semester. Project: Reducing dimensionality and classification accuracy.
Papers to read: "Learning to Decode Cognitive States from Brain Images," Mitchell et al., 2004, "Bayesian Network Classifiers" Friedman et al., 1997.
Project Idea 2: Dimensionality reduction for fMRI data
Explore the use of dimensionality-reduction methods to improve classification accuracy with this data. Given the extremely high dimension of the input (5000 voxels times 8 images) to the classifier, it is sensible to explore methods for reducing this to a small number of dimension. For example, consider PCA, hidden layers of neural nets, or other relevant dimensionality reducing methods. PCA is an example of a method that finds lower dimension representations that minimize error in reconstructing the data. In contract, neural network hidden layes are lower dimensional representations of the inputs that minimize classification error (but only find a local minimum). Does one of these work better? Does it depend on parameters such as the number of training examples?
Papers to read: "Learning to Decode Cognitive States from Brain Images," Mitchell et al., 2004, papers and textbook on PCA, neural nets, or whatever you propose to try.
Project Idea 3: Feature selection/feature invention for fMRI classification.
Project idea: As in many high dimensional data sets, automatic selection of a subset of features can have a strong positive impact on classifier accuracy. We have found that selecting features by the difference in their activity when the subject performs the task, relative to their activity while the subject is resting, is one useful strategy [Mitchell et al., 2004]. In this project you could suggest, implement, and test alternative feature selection strategies (eg., consider the incremental value of adding a new feature to the current feature set, instead of scoring each feature independent of other features that are being selected), and see whether you can obtain higher classification accuracies. Alternatively, you could consider methods for synthesizing new features (e.g., define the 'smoothed value' of a voxel in terms of a spatial Gaussian kernel function applied to it and its neighbors, or define features by averaging voxels whose time series are highly correlated).
Papers to read: "Learning to Decode Cognitive States from Brain Images," Mitchell et al., 2004, papers on feature selection

Image Segmentation Dataset

The goal is to segment images in a meaningful way. Berkeley collected three hundred images and paid students to hand-segment each one (usually each image has multiple hand-segmentations). Two-hundred of these images are training images, and the remaining 100 are test images. The dataset includes code for reading the images and ground-truth labels, computing the benchmark scores, and some other utility functions. It also includes code for a segmentation example.

Project ideas:
Project B1: Region-Based Segmentation
Most segmentation algorithms have focused on segmentation based on edges or based on discontinuity of color and texture. The ground-truth in this dataset, however, allows supervised learning algorithms to segment the images based on statistics calculated over regions. One way to do this is to "oversegment" the image into superpixels (Felzenszwalb 2004, code available) and merge the superpixels into larger segments. Come up with a set of features to represent the superpixels (probably based on color and texture), a classifier/regression algorithm (suggestion: boosted decision trees) that allows you to estimate the likelihood that two superpixels are in the same segment, and an algorithm for segmentation based on those pairwise likelihoods. Since this project idea is fairly time-consuming focusing on a specific part of the project may also be acceptable.
Milestone: By April 12, you should be able to estimate the likelihood that two superpixels are in the same segment and have a quantitative measure of how good your estimator is. You should also have an outline of how to use the likelihood estimates to form the final segmentation. The rest of the project will involve improving your likelihood estimation and your grouping algorithm, and in generating final results.
Papers to read: Some segmentation papers from Berkeley are available here

Project B2: Supervised vs. Unsupervised Segmentation Methods
Write two segmentation algorithms (these may be simpler than the one above): a supervised method (such as logistic regression) and an unsupervised method (such as K-means). Compare the results of the two algorithms. For your write-up, describe the two classification methods that you plan to use.

References: Some segmentation papers from Berkeley are available here

Object Recognition

The Caltech 256 dataset contains images of 256 object categories taken at varying orientations, varying lighting conditions, and with different backgrounds.
http://www.vision.caltech.edu/Image_Datasets/Caltech256/

Project ideas:

You can try to create an object recognition system which can identify which object category is the best match for a given test image.
Apply clustering to learn object categories without supervision, similar to what you did for programming project 3!

Face Recognition Data

There are two data sets for this problem. The first dataset contains 640 images of faces. The faces themselves are images of 20 former Machine Learning students and instructors, with about 32 images of each person. Images vary by the pose (direction the person is looking), expression (happy/sad), face jewelry (sun glasses or not), etc. This gives you a chance to consider a variety of classification problems ranging from person identification to sunglass detection. The data, documentation, and associated code are available here:

Available Software: The same website provides an implementation of a neural network classifier for this image data. The code is quite robust, and pretty well documented in an associated homework assignment.

The second data set consists of 2253 female and 1745 male rectified frontal face images scraped from the hotornot.com website by Ryan White along with user ratings of attractiveness. The data set can be found here:

Facial Attractiveness Images.

Project ideas:

Try SVM's on this data, and compare their performance to that of the provided neural networks
Apply a clustering algorithm to find "similar" faces
Learn a facial attractiveness classifier. A recent NIPS paper on the topic of predicting facial attractiveness can be found here.

Precipitation Data

This dataset has includes 45 years of daily precipitation data from the Northwest of the US.

Project ideas:

Weather prediction: Learn a probabilistic model to predict rain levels
Sensor selection: Where should you place sensor to best predict rain

Something of your own: You have the option of working on a project of your own chosing. You do not have to pick one from this list.