Jessica, Sonha and Tara are all undergraduates at Stanford University. We're majoring in English, Economics and Sociology, respectively. This project is the culmination of our involvement in Matthew Jocker's Digital Humanities class offered Fall Quarter 2007.
Using Oxygen XML Editor, our group coded different aspects of our text analysis project using XML, XHTML, XSLT, and PHP. Group members added metadata and inserted page numbers, page breaks, paragraph tags, and character-quote attribution to a digital transcription of Peter B. Kyne's "Valley of the Giants." Our final product includes a viewer-friendly digitized version of the novel with several functions used for text analysis.
We used PHP primarily for writing scripts to display specific text analysis functions, such as a script for the frequency of every single word in the text, sorted by both frequency and alphabetical order. We extended this feature to compare frequency within this novel with data from the Brown corpus. We also wrote, per class assignment instructions, a keyword search function that would allow users to search for the frequency and relative frequency for a specific word in the text. This tool allowed users to see the context of their keyword as well, with the number of surrounding context words as specified by the user. Both of these features took into account case sensitivity for the word (The vs. the). Our original special feature is a php script which allows us to extract all quotes from the novel and then either search for a character's quotes or view them by chapter. We also added a feature that charts most frequently used phrases with "n-gram" search.
Our primary issue dealt with tagging character quotes, which presented several problems. The first problem was with the quotes itself, particularly the multi-paragraph monologues that extended multiple pages. In order to maintain valid xhtml, we were forced to end the ]]> tags within the ]]> tags even though the quotes actually continued beyond that paragraph. The same problem applied to page breaks where we were again forced to end the quote at the end of a page in order to maintain validity.
We also had to make a decision of where to start and end quotes that contained non-quote context, such as "he said." We decided to not include this text within the quote tags because we felt that it was more important to preserve the actual words spoken within the quote tags, even though it meant breaking one otherwise block quote into two or more quotes tags. We wanted to insure quote extraction accuracy, particularly when using our character-quote function.
We also encountered an issue with missing text. While inserting page breaks and page numbers into our xml file, we noticed that the xml files was missing several blocks of text that were found in the actual hard copy of the book. We remedied this simply by typing up the missing text and including it in our xml.
As a group, we made sure that consistency was the utmost priority. We thus assigned each group members with one specific task to perform for the entire text instead of splitting the text up into different sections. For example, one person would be responsible for tagging all of the character quotes in order to ensure that all of the characters names were consistent ("Bryce" vs. "Bryce Cardigan").
We envision this metadata allowing researchers to investigate issues specific to the characters. For example, did Shirley Sumner have more spoken words as she became more empowered throughout the novel? We could also use this quote analysis tool to look into issues of gender (does one gender speak more than the other?) or to map the relative amounts of dialogue throughout the novel (is there more dialogue in the middle of the book than the beginning or end?). Using the quotation function, we could also analyze a specific character's use of language based on his or her spoken words only, the length of the character's sentences, etc. It provides an easier avenue to accessing very character specific information.
Before this class, no one in our group had worked with XSL or PHP. This project was a learning experience for us and many pages required learning new php and xsl functions.
For example, using an xsl document to transform our html document to xml was hard when we couldn't figure out how to avoid breaking some of the rules of well-formedness. Another technical challenge was finding a regular expression that seperated words - and yet took in to acount contractions. While the technical challenges were numerous, most problems were the result of small bugs rather then overarching issues