CS276B / SYMBSYS 239J / LING 239J |
These materials were developed by Teg Grenager. The section on java.sql was adapted from the CS145 web tutorial by Nathan Folkert and Mayank Bawa. Other parts were adapted from the Sun Microsystems' JAVA language tutorial.
Teams will develop their project components using Java version 2, standard edition. We do this because Java has several nice properties that will make it easier for all of our projects to work together. These properties include:
In particular, for each part of the project, each group will develop their code in a single new package (with possible subpackages) that is contained within the package citeunseen. The groups will then use javadoc (see below) to document their package and the classes contained in it. This will allow the code to live in a single repository, and be interoperable, while remaining nicely organized. With this structure it will be clear who is responsible for each class.
While you may do much of the initial development and testing of your module using a command-line interface, we eventually will want to build a scheduler to coordinate the activity of the various modules. Thus we also ask that each package have a class called ?? which implements the class Runnable, so that it may be called automatically and run in a separate process by the scheduler.
We have developed some initial sample code which demonstrates use of many of the tools described in this tutorial. We will describe in the CVS section below how to obtain it. The basic packages and classes we have developed are:
For those of you unfamiliar with Java, Sun offers a great web-based tutorial that should bring you up to speed: http://java.sun.com/docs/books/tutorial
You will also find it useful to be comfortable with the Java 2.0 Standard Edition API, located at http://java.sun.com/j2se/1.4/docs/api. Pay particular attention to the classes in the java.lang, java.util, java.io, java.sql, and java.net packages.
We will all use CVS to manage concurrent access to the project codebase. It is extremely important that you learn how to use CVS properly, and that you use it religiously.
CVS stands for Concurrent Version Control, and it is a widely used version control system on UNIX platform. It is a command-line system, but it is also integrated into Emacs (and hence XEmacs) very well, so you can diff the files and directories from within Emacs.
If you use it right, it could greatly reduce your integration time, along with backing up all of your changes as you go. Every group member works on their own separate version of code, and after a feature is complete it is integrated into the master repository.
Integration happens for free during the commit process. CVS diffs the newly committed file with the previous versions, looking for changes and incorporating those changes into the repository. Your partners will get the new changes with the cvs update command, which will change the local checked out copy to reflect the repository.
However, at times CVS cannot merge things for you when you're trying to update. This typically happens when your partner has committed a new version of the file since the last time you've updated, and you've also edited the file (so that CVS sees two sets of changes since the last version in the repository, and doesn't know which one supercedes the other). In such cases, it will flag the conflicting changes with <<<<<<< and >>>>>>>>. You will have to merge these sections by hand. This is the trickiest part, so make sure you have both people present when you are solving a merge conflict.
Please note that in order to prevent such conflicts from appearing in the repository, CVS will sometimes force you to update before you commit.
The best place to look is at the tutorials and documentation on the CVS homepage, located at http://www.cvshome.org or in the CVS manual, which you can get to in UNIX by typing
mythXX:~> info cvs
I would highly recommend reading the CVS website, located at http://www.cvshome.org/ (or info cvs) for basic information, looking at the following commands:
checkout, update, commit, add, remove, diff.
It is also a good idea to create a dummy project, add it to the repository, check it out, do some changes and commit them. Then a partner should make sure he can update and that the changes show up in his or her directory.
I would also recommend playing with CVS from inside XEmacs as well. You can get to it from Tools/VC menu, and you have options like Diff buffers, Diff Directories, Visit Other Version, etc.
I personally have found Visit Other Version along with Diff buffers extremely useful. There is even a nice color merging tool within XEmacs as well, very similar to Visual SourceSafe.
Try it out!
Instead of make we will be using ant as our build tool. ant is an XML-based build tool which makes it easy to compile large projects. It was developed by Apache and it is open source. It is much easier to understand than make. We have already installed ant on the leland machines (where we advise that you do your development). For ant to run properly, you first need to add the path /afs/ir/class/cs276b/software/jakarta-ant-1.5.1/bin to your PATH environment variable definition. You also need to set the environmental variable ANT_HOME to the directory /afs/ir/class/cs276b/software/jakarta-ant-1.5.1 and the variable JAVA_HOME to the directory /usr/pubsw/package/Languages/jdk-1.4.0/sun4x_58/apps/jdk-1.4.0 (assuming that you're using a Sun machine -- if not adjust that directory appropriately). For more information on how to do this, see the Environmental Variables section below.
Once you have made the changes above, if your project directory is ~/project you can compile the entire codebase for the project by running ant as follows:
elaine42:~> cd ~/project elaine42:~/project> ant Buildfile: build.xml prepare: compile: BUILD SUCCESSFUL Total time: 1 second
ant uses the file build.xml, located in the project directory, to decide which files to compile, and in what order. I don't think you will need to change the file build.xml but if you do, please let me know. Note that build.xml tells ant to put all of the class files in the directory classes and in the right subdirectories.
Because we have specified the location of the Java libraries (including special ones) in the build.xml file, you shouldn't have to reset you CLASSPATH variable manually. More information and documentation of ant is available at http://jakarta.apache.org/ant/.
javadoc is another wonderful tool that we will be using. It is easy to use, and will save you time in documenting and writing up your project results. To use Javadoc, simply write your package, class, method, and field comments appropriately, and then run javadoc (using ant, see below) to generate a set of HTML pages that bring together all of your comments in a sensible manner. We ask that you turn in javadocs with your assignments, following the conventions described below. This will be the only writeup that we ask you to submit. Of particular importance are the package.html files which describe the functionality of the package as a whole. In these files we would like you to submit additional results about what other methods you tried and why they were or were not used.
We review next how to write comments so that javadoc will understand them.
A doc comment is written in HTML and must precede a class, field,
constructor or method declaration. It is made up of two parts --
a description followed by block tags. In this example, the block
tags are @param
, @return
, and
@see
.
Example
/** * Returns an Image object that can then be painted on the screen. * The url argument must specify an absolute {@link URL}. The name * argument is a specifier that is relative to the url argument. * <p> * This method always returns immediately, whether or not the * image exists. When this applet attempts to draw the image on * the screen, the data will be loaded. The graphics primitives * that draw the image will incrementally paint on the screen. * * @param url an absolute URL giving the base location of the image * name the location of the image, relative to the url argument * @return the image at the specified URL * @see Image */ public Image getImage(URL url, String name) { try { return getImage(new URL(url, name)); } catch (MalformedURLException e) { return null; } }Notes:
|
For more examples, see Simple Examples.
So lines won't wrap, limit any doc-comment lines to 80 characters.
Here is what the previous example would look like after running the Javadoc tool:
getImagepublic Image getImage(URL url, String name)
|
Also see Troubleshooting Curly Quotes (Microsoft Word) at the end of this document.
With Javadoc 1.2, package-level doc comments are available.
Each package can have its own package-level doc comment source file that
The Javadoc tool will merge into the documentation that it produces.
This file is named package.html
(and is same name for
all packages). This file is kept in the source directory along with all
the *.java
files. (Do not put the packages.html
file in the new doc-files source directory, because those files are only
copied to the destination and are not processed.)
Here's an example of a package-level source file for java.text and the file that the Javadoc tool generates:
package.html --------------> package-summary.html (source file) javadoc (destination file)
The Javadoc tool processes package.html
by doing three things:
<body>
and </body>
) below the summary tables in the
destination file package-summary.html
.
@see
, @since
or
{@link}
Javadoc tags that are present.
At Sun Microsystems, we use the following template when creating a new
package doc comment file. This contains a copyright statement.
Obviously, if you are from a different company, you would supply
your own copyright statement. An engineer would copy this whole file,
rename it to package.html
, and delete the lines set off
with hash marks: #####
. One such file should go into
each package directory of the source tree.
The package doc comment should provide (directly or via links)
everything necessary to allow programmers to use the package.
It is a very important piece of documentation:
for many facilities (those that reside in a single package but not in a single
class) it is the first place where programmers will go for documentation. It
should contain a short, readable description of the facilities provided by the
package (in the introduction, below) followed by pointers to detailed
documentation, or the detailed documentation itself, whichever is appropriate.
Which is appropriate will depend on the package: a pointer is appropriate
if it's part of a larger system (such as, one of the 37 packages in Corba),
or if a Framemaker document already exists for the package;
the detailed documentation should be contained in the package doc comment
file itself if the package is self-contained and doesn't require extensive
documentation (such as java.math
).
To sum up, the primary purpose of the package doc comment is to describe the purpose of the package, the conceptual framework necessary to understand and to use it, and the relationships among the classes that comprise it. For large, complex packages (and those that are part of large, complex APIs) a pointer to an external architecture document is warranted.
The following are the sections and headings you should use when writing a package-level comment file. There should be no heading before the first sentence, because the Javadoc tool picks up the first text as the summary statement.
[Omit this section until we implement @category tag]
|
Note that ant can also be used to build javadocs. It is already configured to put everything in the right place.
elaine42:~/project> ant javadoc Buildfile: build.xml prepare: compile: javadoc: [javadoc] Generating Javadoc [javadoc] Javadoc execution [javadoc] Loading source files for package citeunseen.hubprocessor... [javadoc] Loading source files for package citeunseen.paperprocessor... [javadoc] Loading source files for package citeunseen.util... [javadoc] Constructing Javadoc information... [javadoc] Building tree for all the packages and classes... [javadoc] Building index for all the packages and classes... [javadoc] Building index for all classes... BUILD SUCCESSFUL Total time: 2 minutes 48 seconds
More javadoc information is located at http://java.sun.com/j2se/javadoc/.
We'll be using MySQL as our database for this project. Your programs will be connecting to it using JDBC (see below). For development and testing purposes, however, you will sometimes want to interact directly with the database using the mysql client. In order to connect to the database, you will need an account. Please email the TA to ask for an account, and you will receive an email with a username and password. Once you have these you can use the mysql client by typing the following at the command line:
elaine42:~> mysql -u username -p -h tree1 --socket=/tmp/cs276b.sock citationindex Enter password: xxxxxx Reading table information for completion of table and column names You can turn off this feature to get a quicker startup with -A Welcome to the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is 3 to server version: 4.0.7-gamma-standard Type 'help;' or '\h' for help. Type '\c' to clear the buffer. mysql>
Since your username is the same as your login name on the elaine systems, you can omit the "-u username" part.
You now can see that the mysql client is running because of the mysql> prompt. On the command line above, the term citationindex tells the mysql client to connect to the database called "citationindex", which is the one that you should be using for this class. The -u username option tells mysql what username to log in as; if omitted, it tries to use your leland account username. The -p option signifies that you would like to be prompted for a password. If you do not do this, you will not be able to connect to the database. The -h elaine29 option tells the mysql client to connect to the database server that is running on the host elaine29 (as opposed to the localhost). Currently we are running a MySQL server on elaine29 in Sweet Hall for development purposes. (However we may change this if it interferes with Sweet Hall users; if we change this for any reason, we will alert you.) For the production system, however, we will be installing a MySQL server on a dedicated machine that is outside of the Leland file system. However, you should still be able to connect to it just by specifying a different host, and we will give instructions for this when the time comes.
Once you are in the mysql client, you interact with it using SQL statements. For example, typing show tables causes mysql to print out information about all of the tables:
mysql> show tables; +-------------------------+ | Tables_in_citationindex | +-------------------------+ | Author | | AuthorNames | | Authorship | | Citation | | CitationInstance | | HubInstance | | HubPaperInstance | | Journal | | JournalName | | PageInstance | | Paper | | PaperInstance | +-------------------------+ 12 rows in set (0.00 sec) mysql>
You can also do queries such as:
mysql> select count(*) from HubInstance; +----------+ | count(*) | +----------+ | 870 | +----------+ 1 row in set (0.35 sec) mysql>
We have tried to design a schema that will allow every part of the project to store the data they need. However, you will probably find that you need fields that we didn't anticipate. Please email the TA with your request and he will change the schema if it makes sense to do so.
Full documentation for the MySQL database is located at http://www.mysql.com/documentation/mysql/bychapter/. You can also read the textbook for cs145, called Database Systems, The Complete Book by Garcia-Molina, Ullman, and Widom. Also, a good tutorial on the SQL language syntax (for those of you who need brushing up) is located at http://eveander.com/arsdigita/books/sql/.
Call-level interfaces such as JDBC are programming interfaces allowing external access to SQL database manipulation and update commands. They allow the integration of SQL calls into a general programming environment by providing library routines which interface with the database. In particular, Java based JDBC has a rich collection of routines which make such an interface extremely simple and intuitive.
Here is an easy way of visualizing what happens in a call level interface: You are writing a normal Java program. Somewhere in the program, you need to interact with a database. Using standard library routines, you open a connection to the database. You then use JDBC to send your SQL code to the database, and process the results that are returned. When you are done, you close the connection.
As we said earlier, before a database can be accessed, a connection must be opened between our program(client) and the database(server). We want to make this part transparent to you (so that we can change it if we need) so we have created a class called citeunseen.util.ConnectionFactory which has a method called getTestConnection(String user, String password) which returns a java.sql.Connection object that can be used for future interactions with the citationindex database (wherever it may live). You don't need it yet, but there is another method called getProductionConnection(String user, String password) which will return a Connection to the production database. A code snippet shows this in action:
Connection con = null; con = ConnectionFactory.getTestConnection("myuser","mypassword");
That's it! The connection returned is an open connection which we will use to pass SQL statements to the database. In this code snippet, con is an open connection, and we will use it below.
A JDBC Statement object is used to send your SQL statements to the DBMS, and should not to be confused with an SQL statement. A JDBC Statement object is associated with an open connection, and not any single SQL Statement. You can think of a JDBC Statement object as a channel sitting on a connection, and passing one or more of your SQL statements (which you ask it to execute) to the DBMS.
An active connection is needed to create a Statement object. The following code snippet, using our Connection object con, does it for you:
Statement stmt = con.createStatement() ;
At this point, a Statement object exists, but it does not have an SQL statement to pass on to the DBMS. We learn how to do that in a following section.
Sometimes, it is more convenient or more efficient to use a PreparedStatement object for sending SQL statements to the DBMS. The main feature which distinguishes it from its superclass Statement, is that unlike Statement, it is given an SQL statement right when it is created. This SQL statement is then sent to the DBMS right away, where it is compiled. Thus, in effect, a PreparedStatement is associated as a channel with a connection and a compiled SQL statement.
The advantage offered is that if you need to use the same, or similar query with different parameters multiple times, the statement can be compiled and optimized by the DBMS just once. Contrast this with a use of a normal Statement where each use of the same SQL statement requires a compilation all over again.
PreparedStatements are also created with a Connection method. The following snippet shows how to create a parameterized SQL statement with three input parameters:
PreparedStatement prepareUpdatePrice = con.prepareStatement( "UPDATE Sells SET price = ? WHERE bar = ? AND beer = ?");
Before we can execute a PreparedStatement, we need to supply values for the parameters. This can be done by calling one of the setXXX methods defined in the class PreparedStatement. Most often used methods are setInt, setFloat, setDouble, setString etc. You can set these values before each execution of the prepared statement.
Continuing the above example, we would write:
prepareUpdatePrice.setInt(1, 3); prepareUpdatePrice.setString(2, "Bar Of Foo"); prepareUpdatePrice.setString(3, "BudLite");
Executing SQL statements in JDBC varies depending on the ``intention'' of the SQL statement. DDL (data definition language) statements such as table creation and table alteration statements, as well as statements to update the table contents, are all executed using the method executeUpdate. Notice that these commands change the state of the database, hence the name of the method contains ``Update''.
The following snippet has examples of executeUpdate statements.
Statement stmt = con.createStatement(); stmt.executeUpdate("CREATE TABLE Sells " + "(bar VARCHAR2(40), beer VARCHAR2(40), price REAL)" ); stmt.executeUpdate("INSERT INTO Sells " + "VALUES ('Bar Of Foo', 'BudLite', 2.00)" ); String sqlString = "CREATE TABLE Bars " + "(name VARCHAR2(40), address VARCHAR2(80), license INT)" ; stmt.executeUpdate(sqlString);
Since the SQL statement will not quite fit on one line on the page, we have split it into two strings concatenated by a plus sign(+) so that it will compile. Pay special attention to the space following "INSERT INTO Sells" to separate it in the resulting string from "VALUES". Note also that we are reusing the same Statement object rather than having to create a new one.
When executeUpdate is used to call DDL statements, the return value is always zero, while data modification statement executions will return a value greater than or equal to zero, which is the number of tuples affected in the relation.
While working with a PreparedStatement, we would execute such a statement by first plugging in the values of the parameters (as seen above), and then invoking the executeUpdate on it.
int n = prepareUpdatePrice.executeUpdate() ;
As opposed to the previous section statements, a query is expected to return a set of tuples as the result, and not change the state of the database. Not surprisingly, there is a corresponding method called executeQuery, which returns its results as a ResultSet object:
String bar, beer ; float price ; ResultSet rs = stmt.executeQuery("SELECT * FROM Sells"); while ( rs.next() ) { bar = rs.getString("bar"); beer = rs.getString("beer"); price = rs.getFloat("price"); System.out.println(bar + " sells " + beer + " for " + price + " Dollars."); }
The bag of tuples resulting from the query are contained in the variable rs which is an instance of ResultSet. A set is of not much use to us unless we can access each row and the attributes in each row. The ResultSet provides a cursor to us, which can be used to access each row in turn. The cursor is initially set just before the first row. Each invocation of the method next causes it to move to the next row, if one exists and return true, or return false if there is no remaining row.
We can use the getXXX method of the appropriate type to retrieve the attributes of a row. In the previous example, we used getString and getFloat methods to access the column values. Notice that we provided the name of the column whose value is desired as a parameter to the method. Also note that the VARCHAR2 type bar, beer have been converted to Java String, and the REAL to Java float.
Equivalently, we could have specified the column number instead of the column name, with the same result. Thus the relevant statements would be:
bar = rs.getString(1); price = rs.getFloat(3); beer = rs.getString(2);
While working with a PreparedStatement, we would execute a query by first plugging in the values of the parameters, and then invoking the executeQuery on it.
ResultSet rs = prepareUpdatePrice.executeQuery() ;
JDBC also offers you a number of methods to find out where you are in the result set using getRow, isFirst, isBeforeFirst, isLast, isAfterLast.
There are means to make scroll-able cursors allow free access of any row in the result set. By default, cursors scroll forward only and are read only. When creating a Statement for a Connection, you can change the type of ResultSet to a more flexible scrolling or updatable model:
Statement stmt = con.createStatement( ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY); ResultSet rs = stmt.executeQuery("SELECT * FROM Sells");
The different options for types are TYPE_FORWARD_ONLY, TYPE_SCROLL_INSENSITIVE, and TYPE_SCROLL_SENSITIVE. You can choose whether the cursor is read-only or updatable using the options CONCUR_READ_ONLY, and CONCUR_UPDATABLE. With the default cursor, you can scroll forward using rs.next(). With scroll-able cursors you have more options:
rs.absolute(3); // moves to the third tuple rs.previous(); // moves back one tuple (tuple 2) rs.relative(2); // moves forward two tuples (tuple 4) rs.relative(-3); // moves back three tuples (tuple 1)
There are a great many more details to the scroll-able cursor feature. Scroll-able cursors, though useful for certain applications, are extremely high-overhead, and should be used with restraint and caution. More information can be found at the New Features in the JDBC 2.0 API, where you can find a more detailed tutorial on the cursor manipulation techniques.
JDBC allows SQL statements to be grouped together into a single transaction. Thus, we can ensure the ACID (Atomicity, Consistency, Isolation, Durability) properties using JDBC transactional features.
Transaction control is performed by the Connection object. When a connection is created, by default it is in the auto-commit mode. This means that each individual SQL statement is treated as a transaction by itself, and will be committed as soon as it's execution finished. (This is not exactly precise, but we can gloss over this subtlety for most purposes).
We can turn off auto-commit mode for an active connection with :
con.setAutoCommit(false) ;and turn it on again with :
con.setAutoCommit(true) ;
Once auto-commit is off, no SQL statements will be committed (that is, the database will not be permanently updated) until you have explicitly told it to commit by invoking the commit() method:
con.commit() ;
At any point before commit, we may invoke rollback() to rollback the transaction, and restore values to the last commit point (before the attempted updates).
Here is an example which ties these ideas together:
con.setAutoCommit(false); Statement stmt = con.createStatement(); stmt.executeUpdate("INSERT INTO Sells VALUES('Bar Of Foo', 'BudLite', 1.00)" ); con.rollback(); stmt.executeUpdate("INSERT INTO Sells VALUES('Bar Of Joe', 'Miller', 2.00)" ); con.commit(); con.setAutoCommit(true);
Lets walk through the example to understand the effects of various methods. We first set auto-commit off, indicating that the following statements need to be considered as a unit. We attempt to insert into the Sells table the ('Bar Of Foo', 'BudLite', 1.00) tuple. However, this change has not been made final (committed) yet. When we invoke rollback, we cancel our insert and in effect we remove any intention of inserting the above tuple. Note that Sells now is still as it was before we attempted the insert. We then attempt another insert, and this time, we commit the transaction. It is only now that Sells is now permanently affected and has the new tuple in it. Finally, we reset the connection to auto-commit again.
We can also set transaction isolation levels as desired. For example, we can set the transaction isolation level to TRANSACTION_READ_COMMITTED, which will not allow a value to be accessed until after it has been committed, and forbid dirty reads. There are five such values for isolation levels provided in the Connection interface. By default, the isolation level is serializable. JDBC allows us to find out the transaction isolation level the database is set to (using the Connection method getTransactionIsolation) and set the appropriate level (using the Connection method setTransactionIsolation method).
Usually rollback will be used in combination with Java's exception handling ability to recover from (un)predictable errors. Such a combination provides an excellent and easy mechanism for handling data integrity. We study error handling using JDBC in the next section.
The truth is errors always occur in software programs. Often, database programs are critical applications, and it is imperative that errors be caught and handled gracefully. Programs should recover and leave the database in a consistent state. Rollback-s used in conjunction with Java exception handlers are a clean way of achieving such a requirement.
The client(program) accessing a server(database) needs to be aware of any errors returned from the server. JDBC give access to such information by providing two levels of error conditions: SQLException and SQLWarning. SQLExceptions are Java exceptions which, if not handled, will terminate the application. SQLWarnings are subclasses of SQLException, but they represent nonfatal errors or unexpected conditions, and as such, can be ignored.
In Java, statements which are expected to ``throw'' an exception or a warning are enclosed in a try block. If a statement in the try block throws an exception or a warning, it can be ``caught'' in one of the corresponding catch statements. Each catch statement specifies which exceptions it is ready to ``catch''.
Here is an example of catching an SQLException, and using the error condition to rollback the transaction:
try { con.setAutoCommit(false) ; stmt.executeUpdate("CREATE TABLE Sells (bar VARCHAR2(40), " + "beer VARHAR2(40), price REAL)") ; stmt.executeUpdate("INSERT INTO Sells VALUES " + "('Bar Of Foo', 'BudLite', 2.00)") ; con.commit() ; con.setAutoCommit(true) ; }catch(SQLException ex) { System.err.println("SQLException: " + ex.getMessage()) ; con.rollback() ; con.setAutoCommit(true) ; }
In this case, an exception is thrown because beer is defined as VARHAR2 which is a mis-spelling. Since there is no such data type in our DBMS, an SQLException is thrown. The output in this case would be:
Message: ORA-00902: invalid datatype
Alternatively, if your datatypes were correct, an exception might be thrown in case your database size goes over space quota and is unable to construct a new table. SQLWarnings can be retrieved from Connection objects, Statement objects, and ResultSet objects. Each only stores the most recent SQLWarning. So if you execute another statement through your Statement object, for instance, any earlier warnings will be discarded. Here is a code snippet which illustrates the use of SQLWarnings:
ResultSet rs = stmt.executeQuery("SELECT bar FROM Sells") ; SQLWarning warn = stmt.getWarnings() ; if (warn != null) System.out.println("Message: " + warn.getMessage()) ; SQLWarning warning = rs.getWarnings() ; if (warning != null) warning = warning.getNextWarning() ; if (warning != null) System.out.println("Message: " + warn.getMessage()) ;
SQLWarnings (as opposed to SQLExceptions) are actually rather rare -- the most common is a DataTruncation warning. The latter indicates that there was a problem while reading or writing data from the database.
More information about the java.sql package is available in the Java 2.0 API, located at http://java.sun.com/j2se/1.4/docs/api.
Java has made it very easy to access resources over the network with the java.net package. The classes range from very low-level (DatagramPacket and Socket) to high-level (URL and URLConnection) and we should mostly need the high level classes for this project.
The easiest way to create a URL
object is from a String
that represents the human-readable form of the URL address.
This is typically the form that another person will use for a URL.
For example, the URL for the Gamelan site,
which is a directory of Java resources, takes the following form:
http://www.gamelan.com/
In your Java program, you can use a String
containing this text to create a URL
object:
URL gamelan = new URL("http://www.gamelan.com/");
The URL
object created above represents an absolute URL.
An absolute URL contains all of the information necessary to reach the
resource in question.
You can also create URL
objects from a relative URL
address.
A relative URL contains only enough information to reach the resource relative to (or in the context of) another URL.
Relative URL specifications are often used within HTML files. For example,
suppose you write an HTML file called JoesHomePage.html
.
Within this page, are links to other pages, PicturesOfMe.html
and MyKids.html
, that are on the same machine and
in the same directory as JoesHomePage.html
. The links to
PicturesOfMe.html
and MyKids.html
from
JoesHomePage.html
could be specified just as filenames,
like this:
<a href="PicturesOfMe.html">Pictures of Me</a> <a href="MyKids.html">Pictures of My Kids</a>
These URL addresses are relative URLs. That is, the URLs are
specified relative to the file in which they are contained--JoesHomePage.html
.
In your Java programs,
you can create a URL
object from a relative URL specification.
For example, suppose you know two URLs at the Gamelan site:
http://www.gamelan.com/pages/Gamelan.game.html http://www.gamelan.com/pages/Gamelan.net.html
You can create URL
objects for these pages relative
to their common base URL:
http://www.gamelan.com/pages/
like this:
URL gamelan = new URL("http://www.gamelan.com/pages/"); URL gamelanGames = new URL(gamelan, "Gamelan.game.html"); URL gamelanNetwork = new URL(gamelan, "Gamelan.net.html");
This code snippet uses the URL
constructor
that lets you create a URL
object from another URL
object (the base) and a relative URL
specification. The general form of this constructor is:
URL(URL baseURL, String relativeURL)
The first argument is a URL
object
that specifies the base of the new
URL
.
The second argument is a String
that specifies the rest of the
resource name relative to the base. If baseURL
is null, then this
constructor treats relativeURL
like an absolute URL specification.
Conversely, if relativeURL
is an absolute URL specification,
then the constructor ignores baseURL
.
This constructor is also useful for creating URL
objects for named anchors (also called references) within a file.
For example, suppose the Gamelan.network.html
file has a named anchor called BOTTOM
at the
bottom of the file. You can use the relative URL constructor to create
a URL
object for it like this:
URL gamelanNetworkBottom = new URL(gamelanNetwork, "#BOTTOM");
The URL
class provides two additional constructors for creating a URL
object. These constructors are useful when you are working with URLs,
such as HTTP URLs, that have host name, filename, port number, and
reference components in the resource name portion of the URL. These two
constructors are useful when you do not have a String containing the
complete URL specification, but you do know various components of the
URL.
For example, suppose you design a network browsing panel similar to a
file browsing panel that allows users to choose the protocol, host
name, port number, and filename. You can construct a URL
from the panel's components. The first constructor creates a
URL
object from a protocol, host name, and filename. The
following code snippet creates a URL
to the
Gamelan.net.html
file at the Gamelan site:
new URL("http", "www.gamelan.com", "/pages/Gamelan.net.html");
This is equivalent to
new URL("http://www.gamelan.com/pages/Gamelan.net.html");
The first argument is the protocol, the second is the host name, and the last is the pathname of the file. Note that the filename contains a forward slash at the beginning. This indicates that the filename is specified from the root of the host.
The final URL
constructor adds the port number to the list
of arguments used in the previous constructor:
URL gamelan = new URL("http", "www.gamelan.com", 80, "pages/Gamelan.network.html");
This creates a URL
object for the following URL:
http://www.gamelan.com:80/pages/Gamelan.network.html
If you construct a URL
object using one of these
constructors, you can get a String
containing the complete URL address
by using the URL
object's toString
method or the
equivalent toExternalForm
method.
The URL
class provides several methods that let you query
URL
objects.
You can get the protocol, host name,
port number, and filename from a URL using these accessor methods:
getProtocol
getHost
getPort
getPort
method returns an integer that is the
port number. If the port is not set, getPort
returns -1.
getFile
getRef
Note: Remember that not all URL addresses contain these components. The URL class provides these methods because HTTP URLs do contain these components and are perhaps the most commonly used URLs. The URL class is somewhat HTTP-centric.
You can use these getXXX
methods to get information
about the URL regardless of the constructor that you used to create the
URL object.
The URL class, along with these accessor methods, frees you from ever having to parse URLs again! Given any string specification of a URL, just create a new URL object and call any of the accessor methods for the information you need. This small example program creates a URL from a string specification and then uses the URL object's accessor methods to parse the URL:
import java.net.*; import java.io.*; public class ParseURL { public static void main(String[] args) throws Exception { URL aURL = new URL("http://java.sun.com:80/docs/books/" + "tutorial/index.html#DOWNLOADING"); System.out.println("protocol = " + aURL.getProtocol()); System.out.println("host = " + aURL.getHost()); System.out.println("filename = " + aURL.getFile()); System.out.println("port = " + aURL.getPort()); System.out.println("ref = " + aURL.getRef()); } }
Here's the output displayed by the program:
protocol = http host = java.sun.com filename = /docs/books/tutorial/index.html port = 80 ref = DOWNLOADING
After you've successfully created a URL
,
you can call the URL
's
openStream()
method to get a stream from which you
can read the contents of the URL. The openStream()
method returns a
java.io.InputStream
object,
so reading from a URL is as easy as reading from an input stream.
The following small Java program uses openStream()
to get an input
stream on the URL http://www.yahoo.com/
.
It then opens a BufferedReader
on the input stream and reads from the BufferedReader
thereby reading from the URL.
Everything read is copied to the standard output stream:
import java.net.*; import java.io.*; public class URLReader { public static void main(String[] args) throws Exception { URL yahoo = new URL("http://www.yahoo.com/"); BufferedReader in = new BufferedReader( new InputStreamReader( yahoo.openStream())); String inputLine; while ((inputLine = in.readLine()) != null) System.out.println(inputLine); in.close(); } }
When you run the program, you should see, scrolling by in your command
window, the HTML commands and textual content from the HTML file
located at http://www.yahoo.com/
.
Alternatively, the program might hang
or you might see an exception stack trace. If either of the latter two
events occurs, you may have to
set the proxy host so
that the program can find the Yahoo server.
After you've successfully created a URL
object,
you can call the URL
object's openConnection
method to connect to it.
When you connect to a URL
,
you are initializing a communication link
between your Java program and the URL over the network.
For example, you can open a connection to the Yahoo site
with the following code:
try { URL yahoo = new URL("http://www.yahoo.com/"); URLConnection yahooConnection = yahoo.openConnection(); } catch (MalformedURLException e) { // new URL() failed . . . } catch (IOException e) { // openConnection() failed . . . }
If possible, the openConnection
method creates
a new URLConnection
(if an appropriate one does not already exist),
initializes it, connects to the URL,
and returns the URLConnection
object.
If something goes wrong--for example,
the Yahoo server is down--then the openConnection
method throws an IOException.
Now that you've successfully connected to your URL, you can use the
URLConnection
object to perform actions
such as reading from or writing
to the connection. The next section shows you how.
If you've successfully used openConnection
to initiate communications
with a URL, then you have a reference to a URLConnection
object. The
URLConnection
class contains many methods that let you communicate with
the URL over the network. URLConnection
is an HTTP-centric class; that
is, many of its methods are useful only when you are working with HTTP
URLs. However, most URL protocols allow you to read from and write to
the connection. This section describes both functions.
The following program performs the same function as the URLReader
program shown in
Reading Directly from a URL.
However, rather than getting an input stream directly from the URL,
this program explicitly opens a connection to a URL and gets an input
stream from the connection. Then, like URLReader
,
this program creates a BufferedReader
on the input stream and reads from it. The bold
statements highlight the differences between this example and the
previous.
import java.net.*; import java.io.*; public class URLConnectionReader { public static void main(String[] args) throws Exception { URL yahoo = new URL("http://www.yahoo.com/"); URLConnection yc = yahoo.openConnection(); BufferedReader in = new BufferedReader( new InputStreamReader( yc.getInputStream())); String inputLine; while ((inputLine = in.readLine()) != null) System.out.println(inputLine); in.close(); } }
The output from this program is identical to the output from the
program that opens a stream directly from the URL. You can use either
way to read from a URL. However, reading from a URLConnection
instead of reading directly from a URL might be more useful.
This is because you can use the URLConnection
object for other tasks (like writing to the URL) at the same time.
Again, if the program hangs or you see an error message, you may have to set the proxy host so that the program can find the Yahoo server.
Jakarta Lucene is a high-performance, full-featured text search engine written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform. We are planning to extend Lucene for use in building the inverted index for the information retrieval part of this project. The following demo should get you started.
Let's build an index! Assuming ant and the classpath can find Lucene correctly, just type "java org.apache.lucene.demo.IndexFiles {full-path-to-lucene}/src". This will produce a subdirectory called "index" which will contain an index of all of the Lucene sourcecode.
To search the index type "java org.apache.lucene.demo.SearchFiles". You'll be prompted for a query. Type in a swear word and press the enter key. You'll see that the Lucene developers are very well mannered and get no results. Now try entering the word "vector". That should return a whole bunch of documents. The results will page at every tenth result and ask you whether you want more results.
Of course, this demo is just a starting point, and all of the source code and documentation is available. Lucene is already installed in the class directory at /afs/ir/class/cs276b/lib/.
Servlets and JSP are two technologies of the Java 2 Enterprise Edition which make developing web applications in Java very easy. We will use these technologies to build our web front-end to the citation indexing system. A great tutorial for servlets and JSP is located at http://www.apl.jhu.edu/~hall/java/Servlet-Tutorial/. We have installed in the class directory at /afs/ir/class/cs276b/lib/
Tomcat is a open-source web server developed by Apache Jakarta that can also serve up web applications. We will use Tomcat to serve up our Java-based web front-end. A good tutorial is located at http://www.moreservlets.com/Using-Tomcat-4.html. We have installed Tomcat in the class directory at /afs/ir/class/cs276b/software/jakarta-tomcat-4.1.18.
With the Google Web APIs service, software developers can query more than 3 billion web documents directly from their own computer programs. Google uses the SOAP and WSDL standards so a developer can program in his or her favorite environment - such as Java, Perl, or Visual Studio .NET.
To access the Google Web APIs service, you must create a Google Account and obtain a license key. Your Google Account and license key entitle you to 1,000 automated queries per day.
Your program must include your license key with each query you submit to the Google Web APIs service. Check out our Getting Help page or read the FAQs for more information.
The following code snippet illustrates how to perform a search and retrieve results:
public static void main(String[] args) { String clientKey = args[0]; String directive = args[1]; String directiveArg = args[2]; // Create a Google Search object, set our authorization key GoogleSearch s = new GoogleSearch(); s.setKey(clientKey); // Depending on user input, do search or cache query, then print out result try { if (directive.equalsIgnoreCase("search")) { s.setQueryString(directiveArg); GoogleSearchResult r = s.doSearch(); System.out.println("Google Search Results:"); System.out.println("======================"); System.out.println(r.toString()); } else if (directive.equalsIgnoreCase("cached")) { System.out.println("Cached page:"); System.out.println("============"); byte [] cachedBytes = s.doGetCachedPage(directiveArg); // Note - this conversion to String should be done with reference // to the encoding of the cached page, but we don't do that here. String cachedString = new String(cachedBytes); System.out.println(cachedString); } else if (directive.equalsIgnoreCase("spell")) { System.out.println("Spelling suggestion:"); String suggestion = s.doSpellingSuggestion(directiveArg); System.out.println(suggestion); } else { printUsageAndExit(); } } catch (GoogleSearchFault f) { System.out.println("The call to the Google Web APIs failed:"); System.out.println(f.toString()); } }
The Google API is installed in the class directory at /afs/ir/class/cs276b/lib/googleapi.jar. More information is available in the local documentation at at http://www.google.com/apis/.
While we do recommend that you do your project development on the Leland machines, the platform independence of Java means that it is in principle possible to do your development on a remote system (yes, even Windows). As we mentioned above, JDBC can connect to a remote database; all you need is the URL or IP address of the machine (which as you will see is already built into the citeunseen.util.ConnectionFactory class. Also, CVS can access code repositories from remote machines on the network, provided that you have properly set up ssh. More information about running CVS on windows is located at http://www.cvshome.org/cyclic/cvs/windows.html. It is also possible to access the data directories on the leland file system from remote machines using AFS. This is even possible from a Windows platform, but configuring AFS to run on Windows can be tricky. More information about running AFS on Windows can be found at http://www.stanford.edu/group/itss/pcleland/help/afs.htm.
In summary, unless you are a systems wizard and enjoying installing, configuring and fixing lots of third-party software, you probably should do your development on the Leland machines (elaine, myth, saga, etc.).
This doesn't mean that you have to hole up in Sweet Hall, however. There a couple of good tools that let you work on the Leland machines from a remote machine:
Groups working with crawling and hub pages may find it convenient to make use of an HTML parser, which automatically pulls out tags and their attributes. Java does provide an HTML parser in its javax.swing.text.html.parser package, which was used to implement the citeunseen.hubprocessor.LinkExtractor class. However, it is clumsy, because it is DTD-based (Document Type Definition) and there is no easy way to create your own DTD object from a text file containing a DTD (if you find a way, let me know). Thus you may want to use another open-source HTML parser from a third-party developer. We have collected some hopeful looking links below:
Converting from postscript or PDF format to some sort of marked-up text format is a very important first step in extracting useful information from the academic papers. We have found several tools that claim to do this, but none have delivered what we need. We would like to preserve some text formatting (bold, italic, size, horizontal alignment, etc.) but not so much that each word or character is given its own markup. We present several packages below.
Environment variables are set in the .cshrc file in your home directory. The PATH variable is set in the following lines (yours may look somewhat different):
# You may add additional path customizations here. set path=( $site_path ~/bin ~ \ )
To add the directory ~/mydir to the path, you add it to the declaration as follows:
# You may add additional path customizations here. set path=( $site_path ~/bin ~ \ ~/mydir \ )
To set another variable, say the CLASSPATH variable, you look for the environment variables section of the .cshrc file:
#-------------------------# # Environmental Variables # #-------------------------# # Environmental variables are used by both shell and the programs # the shell runs. # EDITOR sets the default editor setenv EDITOR "emacs" setenv VISUAL "$EDITOR"
and if you want to set the variable CLASSPATH to include the current directory . and the directory ~/mydir you add the following line:
setenv CLASSPATH ".:~/mydir"
Note the colon between the different directories. Sometimes you need to add a JAR file, such as /afs/ir/class/cs276b/lib/myjarfile.jar to the java classpath. You need to list the full name of the file, as follows:
setenv CLASSPATH ".:~/mydir:/afs/ir/class/cs276b/lib/myjarfile.jar"