CS276B / SYMBSYS 239J / LING 239J
Text Information Retrieval, Mining, and Exploitation
Winter 2003

Christopher Manning, Prabhakar Raghavan, and Hinrich Schütze

Project Tools Tutorial

These materials were developed by Teg Grenager. The section on java.sql was adapted from the CS145 web tutorial by Nathan Folkert and Mayank Bawa. Other parts were adapted from the Sun Microsystems' JAVA language tutorial.

Java

Teams will develop their project components using Java version 2, standard edition. We do this because Java has several nice properties that will make it easier for all of our projects to work together. These properties include:

It is object oriented
Java is easily portable
Its package structure nicely supports modular development
Its networking package (java.net) makes networking easy and transparent on most platforms
Its database package (java.sql) makes database connectivity easy and transparent on most platforms
There are several other preexisting Java resources that we can make use of (we will discuss several below)
Its exception syntax enforces building robust code
Its automatic documentation system (javadoc, discussed below) makes code more readable and easier to use and reuse
Most students know it?

In particular, for each part of the project, each group will develop their code in a single new package (with possible subpackages) that is contained within the package citeunseen. The groups will then use javadoc (see below) to document their package and the classes contained in it. This will allow the code to live in a single repository, and be interoperable, while remaining nicely organized. With this structure it will be clear who is responsible for each class.

While you may do much of the initial development and testing of your module using a command-line interface, we eventually will want to build a scheduler to coordinate the activity of the various modules. Thus we also ask that each package have a class called ?? which implements the class Runnable, so that it may be called automatically and run in a separate process by the scheduler.

We have developed some initial sample code which demonstrates use of many of the tools described in this tutorial. We will describe in the CVS section below how to obtain it. The basic packages and classes we have developed are:

citeunseen.util.FileRetriever: Can retrieve resources from URLs and store them in files, with associated records in the database. Multi-threaded.
citeunseen.util.DownloaderRunnable: Downloads an InputStream to an OutputStream in a separate thread, and calls a specified method when finished.
citeunseen.util.ConnectionFactory: Has a method that returns a connection to the citationindex database. This allows us to change the location of the database in a manner that is transparent to the user, if necessary.
citeunseen.util.Semaphore: Implements a semaphore for synchronization of multiple threads.
citeunseen.util.PrettyInteger: A cute helper class to convert integers to Strings of subranges of their decimal representation.
citeunseen.hubprocessor.URLFinder: Uses the Google API to get URLs of likely hub pages (pages that contain links to papers).
citeunseen.hubprocessor.HubRetriever: Extends the FileRetriever class to download hubs.
citeunseen.hubprocessor.HubProcessor: Extracts the relevant links to papers from hub HTML pages.
citeunseen.hubprocessor.LinkExtractor: Extracts the links from a particular hub HTML page.
citeunseen.paperprocessor.PaperRetriever: Extends the FileRetriever class to download papers.
citeunseen.paperprocessor.PaperConverter: Uses pstotext to convert PS files to text.
citeunseen.paperprocessor.PaperProcessor: Extracts relevant fields from text versions of papers.

For those of you unfamiliar with Java, Sun offers a great web-based tutorial that should bring you up to speed: http://java.sun.com/docs/books/tutorial

You will also find it useful to be comfortable with the Java 2.0 Standard Edition API, located at http://java.sun.com/j2se/1.4/docs/api. Pay particular attention to the classes in the java.lang, java.util, java.io, java.sql, and java.net packages.

CVS

We will all use CVS to manage concurrent access to the project codebase. It is extremely important that you learn how to use CVS properly, and that you use it religiously.

CVS stands for Concurrent Version Control, and it is a widely used version control system on UNIX platform. It is a command-line system, but it is also integrated into Emacs (and hence XEmacs) very well, so you can diff the files and directories from within Emacs.

Why Use CVS?

If you use it right, it could greatly reduce your integration time, along with backing up all of your changes as you go. Every group member works on their own separate version of code, and after a feature is complete it is integrated into the master repository.

CVS Setup

For CVS to work properly, you need to set the environment variable CVSROOT to point to the code repository for the class, which is located at /afs/ir/class/cs276b/cvsDir. See the Unix Tricks section below for instructions on how to do this.
CVS assumes that the default editor is vi. Most people at Stanford feel more comfortable using Emacs. If you want to have Emacs be your default editor, you need to add the following line to you .cshrc file:
setenv EDITOR "emacs"
You need to completely logout from the machine and log back in so that the new environment variables are used.
There is a directory in the CVS repository called project. Everybody in the group needs to checkout the project from CVS.
CD to the directory where you want it to be:
mythXX:~> cd cs276b
Checkout the project:
mythXX:~/cs276b> cvs checkout project
This should create a copy of the codebase in your account.

Using CVS

Type cvs update at the root of the project in the beginning of every programming session to synchronize the project with the current master copy.
To add new files and directories, type
cvs add <filenames>
After finishing a feature or fixing a bug, type
cvs commit
to commit all the changes in the current directory or
cvs commit <filename>
to commit individual filename(s).

Integration

Integration happens for free during the commit process. CVS diffs the newly committed file with the previous versions, looking for changes and incorporating those changes into the repository. Your partners will get the new changes with the cvs update command, which will change the local checked out copy to reflect the repository.

However, at times CVS cannot merge things for you when you're trying to update. This typically happens when your partner has committed a new version of the file since the last time you've updated, and you've also edited the file (so that CVS sees two sets of changes since the last version in the repository, and doesn't know which one supercedes the other). In such cases, it will flag the conflicting changes with <<<<<<< and >>>>>>>>. You will have to merge these sections by hand. This is the trickiest part, so make sure you have both people present when you are solving a merge conflict.

Please note that in order to prevent such conflicts from appearing in the repository, CVS will sometimes force you to update before you commit.

Where to get Info?

The best place to look is at the tutorials and documentation on the CVS homepage, located at http://www.cvshome.org or in the CVS manual, which you can get to in UNIX by typing

mythXX:~> info cvs

Conclusion

I would highly recommend reading the CVS website, located at http://www.cvshome.org/ (or info cvs) for basic information, looking at the following commands:

checkout, update, commit, add, remove, diff.

It is also a good idea to create a dummy project, add it to the repository, check it out, do some changes and commit them. Then a partner should make sure he can update and that the changes show up in his or her directory.

I would also recommend playing with CVS from inside XEmacs as well. You can get to it from Tools/VC menu, and you have options like Diff buffers, Diff Directories, Visit Other Version, etc.

I personally have found Visit Other Version along with Diff buffers extremely useful. There is even a nice color merging tool within XEmacs as well, very similar to Visual SourceSafe.

Try it out!

Ant

Instead of make we will be using ant as our build tool. ant is an XML-based build tool which makes it easy to compile large projects. It was developed by Apache and it is open source. It is much easier to understand than make. We have already installed ant on the leland machines (where we advise that you do your development). For ant to run properly, you first need to add the path /afs/ir/class/cs276b/software/jakarta-ant-1.5.1/bin to your PATH environment variable definition. You also need to set the environmental variable ANT_HOME to the directory /afs/ir/class/cs276b/software/jakarta-ant-1.5.1 and the variable JAVA_HOME to the directory /usr/pubsw/package/Languages/jdk-1.4.0/sun4x_58/apps/jdk-1.4.0 (assuming that you're using a Sun machine -- if not adjust that directory appropriately). For more information on how to do this, see the Environmental Variables section below.

Once you have made the changes above, if your project directory is ~/project you can compile the entire codebase for the project by running ant as follows:

elaine42:~> cd ~/project
elaine42:~/project> ant
Buildfile: build.xml

prepare:

compile:

BUILD SUCCESSFUL
Total time: 1 second

ant uses the file build.xml, located in the project directory, to decide which files to compile, and in what order. I don't think you will need to change the file build.xml but if you do, please let me know. Note that build.xml tells ant to put all of the class files in the directory classes and in the right subdirectories.

Because we have specified the location of the Java libraries (including special ones) in the build.xml file, you shouldn't have to reset you CLASSPATH variable manually. More information and documentation of ant is available at http://jakarta.apache.org/ant/.

Javadoc

javadoc is another wonderful tool that we will be using. It is easy to use, and will save you time in documenting and writing up your project results. To use Javadoc, simply write your package, class, method, and field comments appropriately, and then run javadoc (using ant, see below) to generate a set of HTML pages that bring together all of your comments in a sensible manner. We ask that you turn in javadocs with your assignments, following the conventions described below. This will be the only writeup that we ask you to submit. Of particular importance are the package.html files which describe the functionality of the package as a whole. In these files we would like you to submit additional results about what other methods you tried and why they were or were not used.

We review next how to write comments so that javadoc will understand them.

Format of a Doc Comment

A doc comment is written in HTML and must precede a class, field, constructor or method declaration. It is made up of two parts -- a description followed by block tags. In this example, the block tags are @param, @return, and @see.

Example

   /**
    * Returns an Image object that can then be painted on the screen. 
    * The url argument must specify an absolute {@link URL}. The name
    * argument is a specifier that is relative to the url argument. 
    * <p>
    * This method always returns immediately, whether or not the 
    * image exists. When this applet attempts to draw the image on
    * the screen, the data will be loaded. The graphics primitives 
    * that draw the image will incrementally paint on the screen. 
    *
    * @param  url  an absolute URL giving the base location of the image
    *         name the location of the image, relative to the url argument
    * @return      the image at the specified URL
    * @see         Image
    */
    public Image getImage(URL url, String name) {
	try {
	    return getImage(new URL(url, name));
	} catch (MalformedURLException e) {
	    return null;
	}
    }

Notes:

The resulting HTML from running Javadoc is shown below
Each line above is indented to align with the code below the comment.
The first line contains the begin-comment delimiter (/**).
Starting with Javadoc 1.4, the leading asterisks are optional.
Write the first sentence as a short summary of the method, as Javadoc automatically places it in the method summary table (and index).
Notice the inline tag {@link URL}, which converts to an HTML hyperlink pointing to the documentation for the URL class. This inline tag can be used anywhere that a comment can be written, such as in the text following block tags.
If you have more than one paragraph in the doc comment, separate the paragraphs with a <p> paragraph tag, as shown.
Insert a blank comment line between the description and the list of tags, as shown.
The first line that begins with an "@" character ends the description. There is only one description block per doc comment; you cannot continue the description following block tags.
The last line contains the end-comment delimiter (*/) Note that unlike the begin-comment delimiter, the end-comment contains only a single asterisk.

For more examples, see Simple Examples.

So lines won't wrap, limit any doc-comment lines to 80 characters.

Here is what the previous example would look like after running the Javadoc tool:

getImage

public Image getImage(URL url,
                      String name)

Returns an Image object that can then be painted on the screen. The url argument must specify an absolute URL. The name argument is a specifier that is relative to the url argument.

This method always returns immediately, whether or not the image exists. When this applet attempts to draw the image on the screen, the data will be loaded. The graphics primitives that draw the image will incrementally paint on the screen.

Parameters:: url - an absolute URL giving the base location of the image; name - the location of the image, relative to the url argument
Returns:: the image at the specified URL
See Also:: Image

Also see Troubleshooting Curly Quotes (Microsoft Word) at the end of this document.

Package-Level Comments

With Javadoc 1.2, package-level doc comments are available. Each package can have its own package-level doc comment source file that The Javadoc tool will merge into the documentation that it produces. This file is named package.html (and is same name for all packages). This file is kept in the source directory along with all the *.java files. (Do not put the packages.html file in the new doc-files source directory, because those files are only copied to the destination and are not processed.)

Here's an example of a package-level source file for java.text and the file that the Javadoc tool generates:

      package.html --------------> package-summary.html
      (source file)    javadoc     (destination file)

The Javadoc tool processes package.html by doing three things:

Copies its contents (everything between <body> and </body>) below the summary tables in the destination file package-summary.html.
Processes any @see, @since or {@link} Javadoc tags that are present.
Copies the first sentence to the right-hand column of the Overview Summary.

Template for package.html source file

At Sun Microsystems, we use the following template when creating a new package doc comment file. This contains a copyright statement. Obviously, if you are from a different company, you would supply your own copyright statement. An engineer would copy this whole file, rename it to package.html, and delete the lines set off with hash marks: #####. One such file should go into each package directory of the source tree.

Empty Template for Package-Level Doc Comment File

Contents of package.html source file

The package doc comment should provide (directly or via links) everything necessary to allow programmers to use the package. It is a very important piece of documentation: for many facilities (those that reside in a single package but not in a single class) it is the first place where programmers will go for documentation. It should contain a short, readable description of the facilities provided by the package (in the introduction, below) followed by pointers to detailed documentation, or the detailed documentation itself, whichever is appropriate. Which is appropriate will depend on the package: a pointer is appropriate if it's part of a larger system (such as, one of the 37 packages in Corba), or if a Framemaker document already exists for the package; the detailed documentation should be contained in the package doc comment file itself if the package is self-contained and doesn't require extensive documentation (such as java.math).

To sum up, the primary purpose of the package doc comment is to describe the purpose of the package, the conceptual framework necessary to understand and to use it, and the relationships among the classes that comprise it. For large, complex packages (and those that are part of large, complex APIs) a pointer to an external architecture document is warranted.

The following are the sections and headings you should use when writing a package-level comment file. There should be no heading before the first sentence, because the Javadoc tool picks up the first text as the summary statement.

Make the first sentence a summary of the package. For example: "Provides classes and interfaces for handling text, dates, numbers and messages in a manner independent of natural languages."
Describe what the package contains and state its purpose.

Package Specification

Include a description of or links to any package-wide specifications for this package that are not included in the rest of the javadoc-generated documentation. For example, the java.awt package might describe how the general behavior in that package is allowed to vary from one operating system to another (Windows, Solaris, Mac).
Include links to any specifications written outside of doc comments (such as in FrameMaker or whatever) if they contain assertions not present in the javadoc-generated files.
An assertion is a statement a conforming implementor would have to know in order to implement the Java platform.
On that basis, at Sun, references in this section are critical to the Java Compatibility Kit (JCK). The Java Compatibility Kit includes a test to verify each assertion, to determine what passes as Java Compatible^TM. The statement "Returns an int" is an assertion. An example is not an assertion.
Some "specifications" that engineers have written contain no assertions not already stated in the API specs (javadoc) -- they just elaborate on the API specs. In this respect, such a document should not be referred to in this section, but rather should be referred to in the next section.
Include specific references. If only a section of a referenced document should be considered part of the API spec, then you should link or refer to only that section and refer to the rest of the document in the next section. The idea is to clearly delineate what is part of the API spec and what is not, so the JCK team can write tests with the proper breadth. This might even encourage some writers to break documents apart so specs are separate.

Related Documentation

Include references to any documents that do not contain specification assertions, such as overviews, tutorials, examples, demos, and guides.

Class and Interface Summary
[Omit this section until we implement @category tag]

Describe logical groupings of classes and interfaces
@see other packages, classes and interfaces

Running javadoc

Note that ant can also be used to build javadocs. It is already configured to put everything in the right place.

elaine42:~/project> ant javadoc
Buildfile: build.xml

prepare:

compile:

javadoc:
  [javadoc] Generating Javadoc
  [javadoc] Javadoc execution
  [javadoc] Loading source files for package citeunseen.hubprocessor...
  [javadoc] Loading source files for package citeunseen.paperprocessor...
  [javadoc] Loading source files for package citeunseen.util...
  [javadoc] Constructing Javadoc information...
  [javadoc] Building tree for all the packages and classes...
  [javadoc] Building index for all the packages and classes...
  [javadoc] Building index for all classes...

BUILD SUCCESSFUL
Total time: 2 minutes 48 seconds

More javadoc information is located at http://java.sun.com/j2se/javadoc/.

MySQL

We'll be using MySQL as our database for this project. Your programs will be connecting to it using JDBC (see below). For development and testing purposes, however, you will sometimes want to interact directly with the database using the mysql client. In order to connect to the database, you will need an account. Please email the TA to ask for an account, and you will receive an email with a username and password. Once you have these you can use the mysql client by typing the following at the command line:

elaine42:~> mysql -u username -p -h tree1 --socket=/tmp/cs276b.sock citationindex
Enter password: xxxxxx

Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 3 to server version: 4.0.7-gamma-standard

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

mysql>

Since your username is the same as your login name on the elaine systems, you can omit the "-u username" part.

You now can see that the mysql client is running because of the mysql> prompt. On the command line above, the term citationindex tells the mysql client to connect to the database called "citationindex", which is the one that you should be using for this class. The -u username option tells mysql what username to log in as; if omitted, it tries to use your leland account username. The -p option signifies that you would like to be prompted for a password. If you do not do this, you will not be able to connect to the database. The -h elaine29 option tells the mysql client to connect to the database server that is running on the host elaine29 (as opposed to the localhost). Currently we are running a MySQL server on elaine29 in Sweet Hall for development purposes. (However we may change this if it interferes with Sweet Hall users; if we change this for any reason, we will alert you.) For the production system, however, we will be installing a MySQL server on a dedicated machine that is outside of the Leland file system. However, you should still be able to connect to it just by specifying a different host, and we will give instructions for this when the time comes.

Once you are in the mysql client, you interact with it using SQL statements. For example, typing show tables causes mysql to print out information about all of the tables:

mysql> show tables;
+-------------------------+
| Tables_in_citationindex |
+-------------------------+
| Author                  |
| AuthorNames             |
| Authorship              |
| Citation                |
| CitationInstance        |
| HubInstance             |
| HubPaperInstance        |
| Journal                 |
| JournalName             |
| PageInstance            |
| Paper                   |
| PaperInstance           |
+-------------------------+
12 rows in set (0.00 sec)

mysql>

You can also do queries such as:

mysql> select count(*) from HubInstance;
+----------+
| count(*) |
+----------+
|      870 |
+----------+
1 row in set (0.35 sec)

mysql>

We have tried to design a schema that will allow every part of the project to store the data they need. However, you will probably find that you need fields that we didn't anticipate. Please email the TA with your request and he will change the schema if it makes sense to do so.

Full documentation for the MySQL database is located at http://www.mysql.com/documentation/mysql/bychapter/. You can also read the textbook for cs145, called Database Systems, The Complete Book by Garcia-Molina, Ullman, and Widom. Also, a good tutorial on the SQL language syntax (for those of you who need brushing up) is located at http://eveander.com/arsdigita/books/sql/.

java.sql Package (JDBC)

Call-level interfaces such as JDBC are programming interfaces allowing external access to SQL database manipulation and update commands. They allow the integration of SQL calls into a general programming environment by providing library routines which interface with the database. In particular, Java based JDBC has a rich collection of routines which make such an interface extremely simple and intuitive.

Here is an easy way of visualizing what happens in a call level interface: You are writing a normal Java program. Somewhere in the program, you need to interact with a database. Using standard library routines, you open a connection to the database. You then use JDBC to send your SQL code to the database, and process the results that are returned. When you are done, you close the connection.

Establishing A Connection

As we said earlier, before a database can be accessed, a connection must be opened between our program(client) and the database(server). We want to make this part transparent to you (so that we can change it if we need) so we have created a class called citeunseen.util.ConnectionFactory which has a method called getTestConnection(String user, String password) which returns a java.sql.Connection object that can be used for future interactions with the citationindex database (wherever it may live). You don't need it yet, but there is another method called getProductionConnection(String user, String password) which will return a Connection to the production database. A code snippet shows this in action:

    Connection con = null;
    con = ConnectionFactory.getTestConnection("myuser","mypassword");

That's it! The connection returned is an open connection which we will use to pass SQL statements to the database. In this code snippet, con is an open connection, and we will use it below.

Creating JDBC Statements

A JDBC Statement object is used to send your SQL statements to the DBMS, and should not to be confused with an SQL statement. A JDBC Statement object is associated with an open connection, and not any single SQL Statement. You can think of a JDBC Statement object as a channel sitting on a connection, and passing one or more of your SQL statements (which you ask it to execute) to the DBMS.

An active connection is needed to create a Statement object. The following code snippet, using our Connection object con, does it for you:

    Statement stmt = con.createStatement() ;

At this point, a Statement object exists, but it does not have an SQL statement to pass on to the DBMS. We learn how to do that in a following section.

Creating JDBC PreparedStatement

Sometimes, it is more convenient or more efficient to use a PreparedStatement object for sending SQL statements to the DBMS. The main feature which distinguishes it from its superclass Statement, is that unlike Statement, it is given an SQL statement right when it is created. This SQL statement is then sent to the DBMS right away, where it is compiled. Thus, in effect, a PreparedStatement is associated as a channel with a connection and a compiled SQL statement.

The advantage offered is that if you need to use the same, or similar query with different parameters multiple times, the statement can be compiled and optimized by the DBMS just once. Contrast this with a use of a normal Statement where each use of the same SQL statement requires a compilation all over again.

PreparedStatements are also created with a Connection method. The following snippet shows how to create a parameterized SQL statement with three input parameters:

   PreparedStatement prepareUpdatePrice = con.prepareStatement( 
      "UPDATE Sells SET price = ? WHERE bar = ? AND beer = ?");

Before we can execute a PreparedStatement, we need to supply values for the parameters. This can be done by calling one of the setXXX methods defined in the class PreparedStatement. Most often used methods are setInt, setFloat, setDouble, setString etc. You can set these values before each execution of the prepared statement.

Continuing the above example, we would write:

   prepareUpdatePrice.setInt(1, 3);
   prepareUpdatePrice.setString(2, "Bar Of Foo");
   prepareUpdatePrice.setString(3, "BudLite");

Executing CREATE/INSERT/UPDATE Statements

Executing SQL statements in JDBC varies depending on the ``intention'' of the SQL statement. DDL (data definition language) statements such as table creation and table alteration statements, as well as statements to update the table contents, are all executed using the method executeUpdate. Notice that these commands change the state of the database, hence the name of the method contains ``Update''.

The following snippet has examples of executeUpdate statements.

   Statement stmt = con.createStatement();

   stmt.executeUpdate("CREATE TABLE Sells " +
      "(bar VARCHAR2(40), beer VARCHAR2(40), price REAL)" );
   stmt.executeUpdate("INSERT INTO Sells " +
      "VALUES ('Bar Of Foo', 'BudLite', 2.00)" );

   String sqlString = "CREATE TABLE Bars " +
      "(name VARCHAR2(40), address VARCHAR2(80), license INT)" ;
   stmt.executeUpdate(sqlString);

Since the SQL statement will not quite fit on one line on the page, we have split it into two strings concatenated by a plus sign(+) so that it will compile. Pay special attention to the space following "INSERT INTO Sells" to separate it in the resulting string from "VALUES". Note also that we are reusing the same Statement object rather than having to create a new one.

When executeUpdate is used to call DDL statements, the return value is always zero, while data modification statement executions will return a value greater than or equal to zero, which is the number of tuples affected in the relation.

While working with a PreparedStatement, we would execute such a statement by first plugging in the values of the parameters (as seen above), and then invoking the executeUpdate on it.

      int n = prepareUpdatePrice.executeUpdate() ;

Executing SELECT Statements

As opposed to the previous section statements, a query is expected to return a set of tuples as the result, and not change the state of the database. Not surprisingly, there is a corresponding method called executeQuery, which returns its results as a ResultSet object:

   String bar, beer ;
   float price ;

   ResultSet rs = stmt.executeQuery("SELECT * FROM Sells");
   while ( rs.next() ) {
      bar = rs.getString("bar");
      beer = rs.getString("beer");
      price = rs.getFloat("price");
      System.out.println(bar + " sells " + beer + " for " + price + " Dollars.");
   }

The bag of tuples resulting from the query are contained in the variable rs which is an instance of ResultSet. A set is of not much use to us unless we can access each row and the attributes in each row. The ResultSet provides a cursor to us, which can be used to access each row in turn. The cursor is initially set just before the first row. Each invocation of the method next causes it to move to the next row, if one exists and return true, or return false if there is no remaining row.

We can use the getXXX method of the appropriate type to retrieve the attributes of a row. In the previous example, we used getString and getFloat methods to access the column values. Notice that we provided the name of the column whose value is desired as a parameter to the method. Also note that the VARCHAR2 type bar, beer have been converted to Java String, and the REAL to Java float.

Equivalently, we could have specified the column number instead of the column name, with the same result. Thus the relevant statements would be:

      bar = rs.getString(1);
      price = rs.getFloat(3);
      beer = rs.getString(2);

While working with a PreparedStatement, we would execute a query by first plugging in the values of the parameters, and then invoking the executeQuery on it.

      ResultSet rs = prepareUpdatePrice.executeQuery() ;

Notes on Accessing ResultSet

JDBC also offers you a number of methods to find out where you are in the result set using getRow, isFirst, isBeforeFirst, isLast, isAfterLast.

There are means to make scroll-able cursors allow free access of any row in the result set. By default, cursors scroll forward only and are read only. When creating a Statement for a Connection, you can change the type of ResultSet to a more flexible scrolling or updatable model:

      Statement stmt = con.createStatement(
         ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY);
      ResultSet rs = stmt.executeQuery("SELECT * FROM Sells");

The different options for types are TYPE_FORWARD_ONLY, TYPE_SCROLL_INSENSITIVE, and TYPE_SCROLL_SENSITIVE. You can choose whether the cursor is read-only or updatable using the options CONCUR_READ_ONLY, and CONCUR_UPDATABLE. With the default cursor, you can scroll forward using rs.next(). With scroll-able cursors you have more options:

      rs.absolute(3);          // moves to the third tuple
      rs.previous();           // moves back one tuple (tuple 2)
      rs.relative(2);          // moves forward two tuples (tuple 4)
      rs.relative(-3);         // moves back three tuples (tuple 1)

There are a great many more details to the scroll-able cursor feature. Scroll-able cursors, though useful for certain applications, are extremely high-overhead, and should be used with restraint and caution. More information can be found at the New Features in the JDBC 2.0 API, where you can find a more detailed tutorial on the cursor manipulation techniques.

Transactions

JDBC allows SQL statements to be grouped together into a single transaction. Thus, we can ensure the ACID (Atomicity, Consistency, Isolation, Durability) properties using JDBC transactional features.

Transaction control is performed by the Connection object. When a connection is created, by default it is in the auto-commit mode. This means that each individual SQL statement is treated as a transaction by itself, and will be committed as soon as it's execution finished. (This is not exactly precise, but we can gloss over this subtlety for most purposes).

We can turn off auto-commit mode for an active connection with :

      con.setAutoCommit(false) ;

and turn it on again with :

      con.setAutoCommit(true) ;

Once auto-commit is off, no SQL statements will be committed (that is, the database will not be permanently updated) until you have explicitly told it to commit by invoking the commit() method:

      con.commit() ;

At any point before commit, we may invoke rollback() to rollback the transaction, and restore values to the last commit point (before the attempted updates).

Here is an example which ties these ideas together:

      con.setAutoCommit(false);
      Statement stmt = con.createStatement();
      stmt.executeUpdate("INSERT INTO Sells VALUES('Bar Of Foo', 'BudLite', 1.00)" );
      con.rollback();
      stmt.executeUpdate("INSERT INTO Sells VALUES('Bar Of Joe', 'Miller', 2.00)" );
      con.commit();
      con.setAutoCommit(true);

Lets walk through the example to understand the effects of various methods. We first set auto-commit off, indicating that the following statements need to be considered as a unit. We attempt to insert into the Sells table the ('Bar Of Foo', 'BudLite', 1.00) tuple. However, this change has not been made final (committed) yet. When we invoke rollback, we cancel our insert and in effect we remove any intention of inserting the above tuple. Note that Sells now is still as it was before we attempted the insert. We then attempt another insert, and this time, we commit the transaction. It is only now that Sells is now permanently affected and has the new tuple in it. Finally, we reset the connection to auto-commit again.

We can also set transaction isolation levels as desired. For example, we can set the transaction isolation level to TRANSACTION_READ_COMMITTED, which will not allow a value to be accessed until after it has been committed, and forbid dirty reads. There are five such values for isolation levels provided in the Connection interface. By default, the isolation level is serializable. JDBC allows us to find out the transaction isolation level the database is set to (using the Connection method getTransactionIsolation) and set the appropriate level (using the Connection method setTransactionIsolation method).

Usually rollback will be used in combination with Java's exception handling ability to recover from (un)predictable errors. Such a combination provides an excellent and easy mechanism for handling data integrity. We study error handling using JDBC in the next section.

Handling Errors with Exceptions

The truth is errors always occur in software programs. Often, database programs are critical applications, and it is imperative that errors be caught and handled gracefully. Programs should recover and leave the database in a consistent state. Rollback-s used in conjunction with Java exception handlers are a clean way of achieving such a requirement.

The client(program) accessing a server(database) needs to be aware of any errors returned from the server. JDBC give access to such information by providing two levels of error conditions: SQLException and SQLWarning. SQLExceptions are Java exceptions which, if not handled, will terminate the application. SQLWarnings are subclasses of SQLException, but they represent nonfatal errors or unexpected conditions, and as such, can be ignored.

In Java, statements which are expected to ``throw'' an exception or a warning are enclosed in a try block. If a statement in the try block throws an exception or a warning, it can be ``caught'' in one of the corresponding catch statements. Each catch statement specifies which exceptions it is ready to ``catch''.

Here is an example of catching an SQLException, and using the error condition to rollback the transaction:

      try {
         con.setAutoCommit(false) ;
         stmt.executeUpdate("CREATE TABLE Sells (bar VARCHAR2(40), " +
                            "beer VARHAR2(40), price REAL)") ;
         stmt.executeUpdate("INSERT INTO Sells VALUES " +
                            "('Bar Of Foo', 'BudLite', 2.00)") ;
         con.commit() ;
         con.setAutoCommit(true) ;

      }catch(SQLException ex) {
         System.err.println("SQLException: " + ex.getMessage()) ;
         con.rollback() ;
         con.setAutoCommit(true) ;
      }

In this case, an exception is thrown because beer is defined as VARHAR2 which is a mis-spelling. Since there is no such data type in our DBMS, an SQLException is thrown. The output in this case would be:

      Message:  ORA-00902: invalid datatype

Alternatively, if your datatypes were correct, an exception might be thrown in case your database size goes over space quota and is unable to construct a new table. SQLWarnings can be retrieved from Connection objects, Statement objects, and ResultSet objects. Each only stores the most recent SQLWarning. So if you execute another statement through your Statement object, for instance, any earlier warnings will be discarded. Here is a code snippet which illustrates the use of SQLWarnings:

      ResultSet rs = stmt.executeQuery("SELECT bar FROM Sells") ;
      SQLWarning warn = stmt.getWarnings() ;
      if (warn != null)
         System.out.println("Message: " + warn.getMessage()) ;
      SQLWarning warning = rs.getWarnings() ;
      if (warning != null)
         warning = warning.getNextWarning() ;
      if (warning != null)
         System.out.println("Message: " + warn.getMessage()) ;

SQLWarnings (as opposed to SQLExceptions) are actually rather rare -- the most common is a DataTruncation warning. The latter indicates that there was a problem while reading or writing data from the database.

More information

More information about the java.sql package is available in the Java 2.0 API, located at http://java.sun.com/j2se/1.4/docs/api.

java.net Package

Java has made it very easy to access resources over the network with the java.net package. The classes range from very low-level (DatagramPacket and Socket) to high-level (URL and URLConnection) and we should mostly need the high level classes for this project.

Creating a URL

The easiest way to create a URL object is from a String that represents the human-readable form of the URL address. This is typically the form that another person will use for a URL. For example, the URL for the Gamelan site, which is a directory of Java resources, takes the following form:

http://www.gamelan.com/

In your Java program, you can use a String containing this text to create a URL object:

URL gamelan = new URL("http://www.gamelan.com/");

The URL object created above represents an absolute URL. An absolute URL contains all of the information necessary to reach the resource in question. You can also create URL objects from a relative URL address.

A relative URL contains only enough information to reach the resource relative to (or in the context of) another URL.

Relative URL specifications are often used within HTML files. For example, suppose you write an HTML file called JoesHomePage.html. Within this page, are links to other pages, PicturesOfMe.html and MyKids.html, that are on the same machine and in the same directory as JoesHomePage.html. The links to PicturesOfMe.html and MyKids.html from JoesHomePage.html could be specified just as filenames, like this:

<a href="PicturesOfMe.html">Pictures of Me</a>
<a href="MyKids.html">Pictures of My Kids</a>

These URL addresses are relative URLs. That is, the URLs are specified relative to the file in which they are contained--JoesHomePage.html.

In your Java programs, you can create a URL object from a relative URL specification. For example, suppose you know two URLs at the Gamelan site:

http://www.gamelan.com/pages/Gamelan.game.html
http://www.gamelan.com/pages/Gamelan.net.html

You can create URL objects for these pages relative to their common base URL: http://www.gamelan.com/pages/ like this:

URL gamelan = new URL("http://www.gamelan.com/pages/");
URL gamelanGames = new URL(gamelan, "Gamelan.game.html");
URL gamelanNetwork = new URL(gamelan, "Gamelan.net.html");

This code snippet uses the URL constructor that lets you create a URL object from another URL object (the base) and a relative URL specification. The general form of this constructor is:

URL(URL baseURL, String relativeURL)

The first argument is a URL object that specifies the base of the new URL. The second argument is a String that specifies the rest of the resource name relative to the base. If baseURL is null, then this constructor treats relativeURL like an absolute URL specification. Conversely, if relativeURL is an absolute URL specification, then the constructor ignores baseURL.

This constructor is also useful for creating URL objects for named anchors (also called references) within a file. For example, suppose the Gamelan.network.html file has a named anchor called BOTTOM at the bottom of the file. You can use the relative URL constructor to create a URL object for it like this:

URL gamelanNetworkBottom = new URL(gamelanNetwork, "#BOTTOM");

The URL class provides two additional constructors for creating a URL object. These constructors are useful when you are working with URLs, such as HTTP URLs, that have host name, filename, port number, and reference components in the resource name portion of the URL. These two constructors are useful when you do not have a String containing the complete URL specification, but you do know various components of the URL.

For example, suppose you design a network browsing panel similar to a file browsing panel that allows users to choose the protocol, host name, port number, and filename. You can construct a URL from the panel's components. The first constructor creates a URL object from a protocol, host name, and filename. The following code snippet creates a URL to the Gamelan.net.html file at the Gamelan site:

new URL("http", "www.gamelan.com", "/pages/Gamelan.net.html");

This is equivalent to

new URL("http://www.gamelan.com/pages/Gamelan.net.html");

The first argument is the protocol, the second is the host name, and the last is the pathname of the file. Note that the filename contains a forward slash at the beginning. This indicates that the filename is specified from the root of the host.

The final URL constructor adds the port number to the list of arguments used in the previous constructor:

URL gamelan = new URL("http", "www.gamelan.com", 80,
                       "pages/Gamelan.network.html");

This creates a URL object for the following URL:

http://www.gamelan.com:80/pages/Gamelan.network.html

If you construct a URL object using one of these constructors, you can get a String containing the complete URL address by using the URL object's toString method or the equivalent toExternalForm method.

Parsing a URL

The URL class provides several methods that let you query URL objects. You can get the protocol, host name, port number, and filename from a URL using these accessor methods:

getProtocol: Returns the protocol identifier component of the URL.
getHost: Returns the host name component of the URL.
getPort: Returns the port number component of the URL. The getPort method returns an integer that is the port number. If the port is not set, getPort returns -1.
getFile: Returns the filename component of the URL.
getRef: Returns the reference component of the URL.

Note: Remember that not all URL addresses contain these components. The URL class provides these methods because HTTP URLs do contain these components and are perhaps the most commonly used URLs. The URL class is somewhat HTTP-centric.

You can use these getXXX methods to get information about the URL regardless of the constructor that you used to create the URL object.

The URL class, along with these accessor methods, frees you from ever having to parse URLs again! Given any string specification of a URL, just create a new URL object and call any of the accessor methods for the information you need. This small example program creates a URL from a string specification and then uses the URL object's accessor methods to parse the URL:

import java.net.*;
import java.io.*;

public class ParseURL {
    public static void main(String[] args) throws Exception {
        URL aURL = new URL("http://java.sun.com:80/docs/books/"
                           + "tutorial/index.html#DOWNLOADING");
        System.out.println("protocol = " + aURL.getProtocol());
        System.out.println("host = " + aURL.getHost());
        System.out.println("filename = " + aURL.getFile());
        System.out.println("port = " + aURL.getPort());
        System.out.println("ref = " + aURL.getRef());
    }
}

Here's the output displayed by the program:

protocol = http
host = java.sun.com
filename = /docs/books/tutorial/index.html
port = 80
ref = DOWNLOADING

Reading Directly from a URL

After you've successfully created a URL, you can call the URL's openStream() method to get a stream from which you can read the contents of the URL. The openStream() method returns a java.io.InputStreamobject, so reading from a URL is as easy as reading from an input stream.

The following small Java program uses openStream() to get an input stream on the URL http://www.yahoo.com/. It then opens a BufferedReader on the input stream and reads from the BufferedReader thereby reading from the URL. Everything read is copied to the standard output stream:

import java.net.*;
import java.io.*;

public class URLReader {
    public static void main(String[] args) throws Exception {
	URL yahoo = new URL("http://www.yahoo.com/");
	BufferedReader in = new BufferedReader(
				new InputStreamReader(
				yahoo.openStream()));

	String inputLine;

	while ((inputLine = in.readLine()) != null)
	    System.out.println(inputLine);

	in.close();
    }
}

When you run the program, you should see, scrolling by in your command window, the HTML commands and textual content from the HTML file located at http://www.yahoo.com/. Alternatively, the program might hang or you might see an exception stack trace. If either of the latter two events occurs, you may have to set the proxy host so that the program can find the Yahoo server.

Connecting to a URL

After you've successfully created a URL object, you can call the URL object's openConnection method to connect to it. When you connect to a URL, you are initializing a communication link between your Java program and the URL over the network. For example, you can open a connection to the Yahoo site with the following code:

try {
    URL yahoo = new URL("http://www.yahoo.com/");
    URLConnection yahooConnection = yahoo.openConnection();

} catch (MalformedURLException e) {     // new URL() failed
    . . .
} catch (IOException e) {               // openConnection() failed
    . . .
}

If possible, the openConnection method creates a new URLConnection (if an appropriate one does not already exist), initializes it, connects to the URL, and returns the URLConnection object. If something goes wrong--for example, the Yahoo server is down--then the openConnection method throws an IOException.

Now that you've successfully connected to your URL, you can use the URLConnection object to perform actions such as reading from or writing to the connection. The next section shows you how.

Reading from and Writing to a URLConnection

If you've successfully used openConnection to initiate communications with a URL, then you have a reference to a URLConnection object. The URLConnection class contains many methods that let you communicate with the URL over the network. URLConnection is an HTTP-centric class; that is, many of its methods are useful only when you are working with HTTP URLs. However, most URL protocols allow you to read from and write to the connection. This section describes both functions.

Reading from a URLConnection

The following program performs the same function as the URLReader program shown in Reading Directly from a URL.

However, rather than getting an input stream directly from the URL, this program explicitly opens a connection to a URL and gets an input stream from the connection. Then, like URLReader, this program creates a BufferedReader on the input stream and reads from it. The bold statements highlight the differences between this example and the previous.

import java.net.*;
import java.io.*;

public class URLConnectionReader {
    public static void main(String[] args) throws Exception {
        URL yahoo = new URL("http://www.yahoo.com/");
        URLConnection yc = yahoo.openConnection();
        BufferedReader in = new BufferedReader(
                                new InputStreamReader(
                                yc.getInputStream()));
        String inputLine;

        while ((inputLine = in.readLine()) != null) 
            System.out.println(inputLine);
        in.close();
    }
}

The output from this program is identical to the output from the program that opens a stream directly from the URL. You can use either way to read from a URL. However, reading from a URLConnection instead of reading directly from a URL might be more useful. This is because you can use the URLConnection object for other tasks (like writing to the URL) at the same time.

Again, if the program hangs or you see an error message, you may have to set the proxy host so that the program can find the Yahoo server.

Lucene

Jakarta Lucene is a high-performance, full-featured text search engine written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform. We are planning to extend Lucene for use in building the inverted index for the information retrieval part of this project. The following demo should get you started.

Let's build an index! Assuming ant and the classpath can find Lucene correctly, just type "java org.apache.lucene.demo.IndexFiles {full-path-to-lucene}/src". This will produce a subdirectory called "index" which will contain an index of all of the Lucene sourcecode.

To search the index type "java org.apache.lucene.demo.SearchFiles". You'll be prompted for a query. Type in a swear word and press the enter key. You'll see that the Lucene developers are very well mannered and get no results. Now try entering the word "vector". That should return a whole bunch of documents. The results will page at every tenth result and ask you whether you want more results.

Of course, this demo is just a starting point, and all of the source code and documentation is available. Lucene is already installed in the class directory at /afs/ir/class/cs276b/lib/.

Servlets, JSP, and Tomcat

Servlets and JSP are two technologies of the Java 2 Enterprise Edition which make developing web applications in Java very easy. We will use these technologies to build our web front-end to the citation indexing system. A great tutorial for servlets and JSP is located at http://www.apl.jhu.edu/~hall/java/Servlet-Tutorial/. We have installed in the class directory at /afs/ir/class/cs276b/lib/

Tomcat is a open-source web server developed by Apache Jakarta that can also serve up web applications. We will use Tomcat to serve up our Java-based web front-end. A good tutorial is located at http://www.moreservlets.com/Using-Tomcat-4.html. We have installed Tomcat in the class directory at /afs/ir/class/cs276b/software/jakarta-tomcat-4.1.18.

Google API

With the Google Web APIs service, software developers can query more than 3 billion web documents directly from their own computer programs. Google uses the SOAP and WSDL standards so a developer can program in his or her favorite environment - such as Java, Perl, or Visual Studio .NET.

To access the Google Web APIs service, you must create a Google Account and obtain a license key. Your Google Account and license key entitle you to 1,000 automated queries per day.

Your program must include your license key with each query you submit to the Google Web APIs service. Check out our Getting Help page or read the FAQs for more information.

The following code snippet illustrates how to perform a search and retrieve results:

  public static void main(String[] args) {

    String clientKey = args[0];
    String directive = args[1];
    String directiveArg = args[2];

    // Create a Google Search object, set our authorization key
    GoogleSearch s = new GoogleSearch();
    s.setKey(clientKey);

    // Depending on user input, do search or cache query, then print out result
    try {
      if (directive.equalsIgnoreCase("search")) {
        s.setQueryString(directiveArg);
        GoogleSearchResult r = s.doSearch();
        System.out.println("Google Search Results:");
        System.out.println("======================");
        System.out.println(r.toString());
      } else if (directive.equalsIgnoreCase("cached")) {
        System.out.println("Cached page:");
        System.out.println("============");
        byte [] cachedBytes = s.doGetCachedPage(directiveArg);
        // Note - this conversion to String should be done with reference
        // to the encoding of the cached page, but we don't do that here.
        String cachedString = new String(cachedBytes);
        System.out.println(cachedString);
      } else if (directive.equalsIgnoreCase("spell")) {
        System.out.println("Spelling suggestion:");
        String suggestion = s.doSpellingSuggestion(directiveArg);
        System.out.println(suggestion);
      } else {
        printUsageAndExit();
      }
    } catch (GoogleSearchFault f) {
      System.out.println("The call to the Google Web APIs failed:");
      System.out.println(f.toString());
    }
  }

The Google API is installed in the class directory at /afs/ir/class/cs276b/lib/googleapi.jar. More information is available in the local documentation at at http://www.google.com/apis/.

Computing Environment

While we do recommend that you do your project development on the Leland machines, the platform independence of Java means that it is in principle possible to do your development on a remote system (yes, even Windows). As we mentioned above, JDBC can connect to a remote database; all you need is the URL or IP address of the machine (which as you will see is already built into the citeunseen.util.ConnectionFactory class. Also, CVS can access code repositories from remote machines on the network, provided that you have properly set up ssh. More information about running CVS on windows is located at http://www.cvshome.org/cyclic/cvs/windows.html. It is also possible to access the data directories on the leland file system from remote machines using AFS. This is even possible from a Windows platform, but configuring AFS to run on Windows can be tricky. More information about running AFS on Windows can be found at http://www.stanford.edu/group/itss/pcleland/help/afs.htm.

In summary, unless you are a systems wizard and enjoying installing, configuring and fixing lots of third-party software, you probably should do your development on the Leland machines (elaine, myth, saga, etc.).

This doesn't mean that you have to hole up in Sweet Hall, however. There a couple of good tools that let you work on the Leland machines from a remote machine:

SSH: If you know all the emacs key commands by heart, and don't need a mouse interface, you may be happy to use a dumb terminal interface, and so can just use PC-Samson or download a SSH client for your computer.
XWin32: This is an X-Win client for your computer, available at http://www.starnet.com/products. However, you have to buy a license.
VNC: This freeware gives you the look and feel of your X-session on the remote machine, available at http://www.uk.research.att.com/vnc/

Other Resources:

HTML Parsers

Groups working with crawling and hub pages may find it convenient to make use of an HTML parser, which automatically pulls out tags and their attributes. Java does provide an HTML parser in its javax.swing.text.html.parser package, which was used to implement the citeunseen.hubprocessor.LinkExtractor class. However, it is clumsy, because it is DTD-based (Document Type Definition) and there is no easy way to create your own DTD object from a text file containing a DTD (if you find a way, let me know). Thus you may want to use another open-source HTML parser from a third-party developer. We have collected some hopeful looking links below:

Other PS/PDF to Text Utilities

Converting from postscript or PDF format to some sort of marked-up text format is a very important first step in extracting useful information from the academic papers. We have found several tools that claim to do this, but none have delivered what we need. We would like to preserve some text formatting (bold, italic, size, horizontal alignment, etc.) but not so much that each word or character is given its own markup. We present several packages below.

prescript: Developed by the New Zealand Digital Library (NZDL) project. This is what CiteSeer used originally. It has some flaws, and so CiteSeer no longer uses it. We have installed it in the cs276b/software directory. Information available at http://www.nzdl.org/html/prescript.html.
pstotext: Developed by the Compaq Virtual Paper project. This is what CiteSeer uses now. We have installed it in the cs276b/software directory. Information available at "http://research.compaq.com/SRC/virtualpaper/pstotext.html.
my_ps2ascii: This is Andrew Ng's homegrown PS to text program that he built to remedy some of the problems of pstotext. The source files are installed in the cs276b/software directory.

Java Web Crawlers

Linguistic Tools

Porter's stemmer - a Java library for stemming terms. Might be useful in indexing. Located at "http://www.tartarus.org/~martin/PorterStemmer/.

UNIX Tricks

Setting Environment Variables

Environment variables are set in the .cshrc file in your home directory. The PATH variable is set in the following lines (yours may look somewhat different):

# You may add additional path customizations here.
set path=( $site_path ~/bin ~ \
    )

To add the directory ~/mydir to the path, you add it to the declaration as follows:

# You may add additional path customizations here.
set path=( $site_path ~/bin ~ \
           ~/mydir \
    )

To set another variable, say the CLASSPATH variable, you look for the environment variables section of the .cshrc file:

#-------------------------#
# Environmental Variables #
#-------------------------#

# Environmental variables are used by both shell and the programs
# the shell runs.

# EDITOR sets the default editor
setenv EDITOR "emacs"
setenv VISUAL "$EDITOR"

and if you want to set the variable CLASSPATH to include the current directory . and the directory ~/mydir you add the following line:

setenv CLASSPATH ".:~/mydir"

Note the colon between the different directories. Sometimes you need to add a JAR file, such as /afs/ir/class/cs276b/lib/myjarfile.jar to the java classpath. You need to list the full name of the file, as follows:

setenv CLASSPATH ".:~/mydir:/afs/ir/class/cs276b/lib/myjarfile.jar"

Back to the CS276B homepage
Last modified: Fri Jun 27 11:30:43 PDT 2003

CS276B / SYMBSYS 239J / LING 239J Text Information Retrieval, Mining, and Exploitation Winter 2003