even if our expectations are based on stereotypes rather
than authentic experience (McGowan, in press).
In Sumner, Kim, King, and McGowan (2014) we propose a model (above) of how the linguistic and the social aspects of speech interact to support perception. We propose that listeners process both phonetically cued social information and phonetically cued linguistic information prior to word recognition and that these dual routes can interact.
So does all this knowledge and sensitivity only apply to social variation?
First, some quick background on how sounds like [p], [t], and [k] differ from sounds like [b], [d], and [g] at the beginning of English words like pit and bit. What word is this native American English speaker saying?
The image to the left is a spectrogram (frequency analysis over time) of the word pit. Hear the puff of air at the beginning? It is highlighted in blue in the spectrogram.
pit and bit both start with the lips completely closed. One of the main differences between them is the duration of the puff of air, this duration is called VOT (voice onset time).
At least in American English, that puff of air is so important that cutting it out of pit (that first sound you played) results in a word that sounds a lot like bit —though probably with a funny [b], and that funniness is every bit as interesting and important as the change from [p] to [b]!
When we listen to speech we are phenomenally sensitive to covarying patterns
of phonetic detail. One such covarying pattern is the tendency for VOT
to be shorter in a fast speech style than in slower speech...
In fact, removing most of the VOT from [p], [t], and [k] words makes them
less useful to listeners (shortest green bar) in slow (Citation) speech, but
if the rest of the word is spoken quickly the short VOT sounds fine (Fast speech, on the right) (abstract).
Okay, but slower, more careful speech must be easier to understand than faster, more casual speech?
Well.. no! When we tested how well careful and casual speech styles
activate meanings for listeners in a sentence like "Elephants are
big animals". We found that casual speech was actually more
helpful than careful speech. (CUNY 2014 presentation).
We also found this with eye tracking...
When hearing sentences with predictable final words, listeners were able to look at the intended picture more quickly with casual speech than with careful speech. This and other evidence supports the hypotheses of Lindblom (1990) and Sumner (2013) that casual speech is processed using more world and contextual knowledge than careful speech.
Another covarying feature is the way vowels before nasal consonants in English tend to be nasalized. Listeners can use this as soon as it becomes available, not only a large distinction like bend/bed...
but also a much more subtle distinction like the difference in nasalization between these two sound files. Can you hear a difference?
This first recording has late nasalization starting 100 miliseconds after the [b].
This second recording has early nasalization starting 33 miliseconds after the [b].
In an eye tracking task we found that listeners can use nasalization as soon as it is present. Looks to the heavily-nasalized word were, on average, 60 ms faster —the same average difference between early and late nasalization in the recordings (Beddor, McGowan, Boland, Coetzee, and Brasher, 2013).
Whether the information is social, contextual, articulatory, or
idiosyncratic, we humans have an astonishing ability to attend
to it, remember it, and activate it during perception. This ability, my
research suggests, is not irrelevant to linguistic competence or
even peripheral to it, it is fundamentally and centrally part of
what it means to know and speak a human language.
Thank you for reading! If you have any questions, please contact me via e-mail or twitter.
And many, many thanks to my friend Markus Nee for turning me into this cartoon.