|
CS276B / SYMBSYS 239J / LING 239J
Text Information Retrieval, Mining, and Exploitation
Winter 2003 |
Christopher Manning, Prabhakar Raghavan, and Hinrich
Schütze
Grading Information
General Grading Policy:
Midterm Exam |
20% |
Final Exam |
40% |
Project |
40% (divided as follows) |
Part 1A |
8% |
Part 1B |
8% |
Part 2 |
24% |
Midterm Grading Details
The midterm is a one hour exam held during classtime on Thursday, February 6, 2003. It is open book, open notes. The only things disallowed are networked devices. It will cover the following topics:
Document Clustering
- Agglomerative clustering
- Hierarchical clustering
- k-means
- Selecting the number of clusters
- Term vs. document space
- Feature selection
- Clustering to speed up scoring
- Labeling clusters
- Clustering as dimensionality reduction
- Evaluation of text clustering
- Link-based clustering
- Enumerative clustering/trawling
Text Classification
- Methods
- Generative models
- Maximum likelihood
- Naive Bayes
- Multivariate binomial vs. multinomial
- Feature selection via mutual information
- Conditional independence assumption
- Relation to information extraction
- Feature selection
- Evaluating categorization methods
Information Extraction
- Hand coded wrappers
- Wrapper induction: LR, HLRT, BWI wrappers
- Named entity recognition
- FSA-based methods: FASTUS
- Learning information extractors
- HMMs for information extraction
- Web wrappers and agents
Project 1A Grading Details
Project part 1A submissions are due by Monday, January 27, 2003 at
11:59 pm. Your submission should include:
- A subpackage of the citeunseen package which contains all
of your source code (*.java files). This code should compile
using ant, and if we need special libraries, please
make sure they are posted in the
/afs/ir/class/cs276b/lib/ directory, so that they are
automatically added to the classpath.
- You should have
a class in your package called Main that implements the
java.lang.Runnable interface, and that has a
main(String[]) function allowing command line
execution. If arguments are required, please error check the
arguments and output appropriate usage information.
- Appropriate documentation in javadoc format. It
is especially important that you include a
package.html file in your top-level package to
provide an overview of the usage, the design, and the
rationale behind design decisions.
- A proposed name
for our collective citation indexing system (have fun with
this!).
We will grade your project submissions using the following criteria:
- Performance: The performance of your program on real data. We will consider the
portion of the data (or cases, as appropriate) that it is
able to process reasonably, the quality of the processing that it
performs, and the gracefulness with which
it handles errors (e.g., your code should not exit or
corrupt data structures).
- Design: The thoughtfulness and elegance with which your
code was designed. This includes the algorithms and data
structures that you chose to use and
the internal structure of the code (nice use of OOP
programming principles, for example).
- Writeup: We will read your package-level
documentation in the package.html
file to understand the techniques that you tried and the
reasoning behind your design decisions.
- Intelligibility: We will consider the overall
intelligibility of your code. Your score will suffer if we
have difficulty understanding the functioning of your code
by reading the documentation and the code itself. Make sure
you do extensive commenting at the package, class, method,
and field level, in javadoc format. Also, some inline comments may be necessary.
- Initiative: We will assess the degree to which you
demonstrated initiative in researching and implementing
your procedure. It will help if you find a paper, algorithm,
or dataset that we are unaware of.
Project 1B Grading Details
Project part 1B submissions are due by Tuesday, February 11, 2003 at
11:59 pm. Your submission should include:
- A subpackage of the citeunseen package which contains all
of your source code (*.java files). This code should compile
using ant, and if we need special libraries, please
make sure they are posted in the
/afs/ir/class/cs276b/lib/ directory, so that they are
automatically added to the classpath.
- You should have
a class in your package called Main that implements the
java.lang.Runnable interface, and that has a
main(String[]) function allowing command line
execution. It should be able to run with no arguments.
- All parameters should be stored in the properties.dat file
located in the class data directory, and accessed using the Properties
and PropertiesManager classes
- Your program should be hooked up to the correct database,
tables, and file system directories
- Appropriate documentation in javadoc format. It
is especially important that you include a
package.html file in your top-level package to
provide an overview of the usage, the design, and the
rationale behind design decisions.
We will grade your project submissions using the following criteria:
- Performance: The performance of your program on real data. We will consider the
portion of the data (or cases, as appropriate) that it is
able to process reasonably, the quality of the processing that it
performs, and the gracefulness with which
it handles errors (e.g., your code should not exit or
corrupt data structures).
- Design: The thoughtfulness and elegance with which your
code was designed. This includes the algorithms and data
structures that you chose to use and
the internal structure of the code (nice use of OOP
programming principles, for example).
- Writeup: We will read your package-level
documentation in the package.html
file to understand the techniques that you tried and the
reasoning behind your design decisions.
- Intelligibility: We will consider the overall
intelligibility of your code. Your score will suffer if we
have difficulty understanding the functioning of your code
by reading the documentation and the code itself. Make sure
you do extensive commenting at the package, class, method,
and field level, in javadoc format. Also, some inline
comments may be necessary.
- Initiative: We will assess the degree to which you
demonstrated initiative in researching and implementing
your procedure. It will help if you find a paper, algorithm,
or dataset that we are unaware of.
Back to the CS276B homepage
Last modified: Tue Feb 11 14:38:19 PST 2003