2 May 1997

New Frontiers in Punctuation Research

Geoff Nunberg

Xerox PARC and Stanford University

For a long time punctuation seemed a subject unworthy of serious linguistic attention, a victim of the general neglect of the structures of written language, which was regarded on grounds of general principle as merely a degenerate representation of the spoken language. Serious linguistic studies of punctuation began to appear a few years ago, as part of a reconsideration of standard linguistic dogma about written language that was motivated to some degree by the increased importance of written texts in work on parsing, generation, and other NLP problems. These studies -- for example Nunberg (1990), Dale (1991), Jones (1993), Briscoe (1994, 1995), Lee (1995), White (1995), Say (1996), and Akman and Say (1997) -- have revealed a rich and by no means obvious grammatical subsystem, which shows interesting formal differences from the rules of the spoken language.

There are roughly three possible approaches to punctuation. The first is the approach of traditional grammar, which associates each mark with a number of distinct syntactic and semantic functions, whose relationship is generally not specified. The second is the approach developed in Nunberg (1990) and most subsequent studies, where punctuation marks are treated as the surface indicators of various syntactic categories. From this point of view it is the underlying categories, rather than marks themselves, that bear most of the theoretical interest. One difficulty with this way of thinking about things, though, is that it fails to capture certain generalizations about specific mark types. For example the semicolon can appear both as a separator of items in series (i.e., the "promotion" semicolon) and as a kind of sentential conjunction. But while these two functions are syntactically distinct, all semicolons have the property of insisting on widest scope within their clause, and this property is best thought of "punctocentrically," as a single constraint associated with a particular mark, rather than as a set of constraints affecting the various categories that it is used to separate. By way of demonstrating the advantages of this approach, I'll show how it allows us to explain a number of changes in the uses of the semicolon over the past hundred years or so by positing the disappearance of a single rule of the grammar of the written language.