Synthesis Chapter Four: Speech Synthesis and the Channel Vocoder 4

The Vocoder Resynthesis Stage

The previous page's analysis stage diagram, given more numerous channels, would be passable for producing formant (vowel) sounds in a robotic fashion when resynthesized. But it would fall short in producing consonant, or totally or partially unvoiced sounds (plosives, sibilants, fricatives and africatives, to mention a few) that involve either stopping the vibration of the vocal cords (unvoiced, such as in /s/, or combine the two, such as the nasal /n/ at the end of a word like bun). In addition, it is not an accurate model of how our bodies produce speech, which was Dudley’s original aim.

The circuit diagramed below addresses these issues and is a close analog to Dudley’s original vocoder. The analysis phase filter bank and amplitude followers are similar between the the one below and the one on the previous page. However, a detector to determine whether the speech input is voiced or unvoiced has been added. This can be done in modern-day digital signal processing by setting a threshold for how rapidly the waveform crosses zero (Max users can use the xerox~ object)–unvoiced or consonant sounds will have a much higher rate of zero-crossings, though this is by no means perfect. Dudley likely used a fundamental frequency tracker to make that determination, with a much higher fundamental frequency via highpass filter indicating sibilance or a consonant. With either method, the threshold of what is considered voiced or unvoiced can usually be adjusted on more modern models to get the desired result. How this data is used will be covered below.

basic vocoder

The crux of Dudley’s design in mimicking the human vocal tract is his use of a tunable buzzy VCO "carrier" rich in strong partials throughout the frequency spectrum. This would now most likely to be a band-limited pulse train waveform, but Dudley used a spaced pulse/sawtooth-type wave from a relaxation oscillator. This carrier, as he called it, served as an analog to our vocal cords/glottis (larynx). The carrier VCO is routed in parallel to an output filter and VCA bank (which Dudley called the "modulator" section). The channel voltage-controlled amplifiers VCA's are controlled by the amplitude of the signals arriving from the analysis stage amplitude detectors. They allow more or less of the resonated carrier VCO through for each band as the input channel control stream varies. The circuit acted, in Dudley's mind , as an analog to the way we shape our vocal cavity (throat, mouth, nose sinuses) to produce formants. The filters, either passive bandpass or active resonant filters that ring (resonators), can correspond to the frequency and bandwidth coefficients of the input filter bank for a more realistic feel (or not, for a completely different effect, such as formant shifting).

But what about the hissy, turbulent unvoiced sounds? A control voltage signal from the voiced/unvoiced detector is sent to the resynthesis phase to switch the vibration source between the carrier buzzy VCO and highpass-filtered noise (what Dudley called “hiss,” most like produced by a gas-discharge tube filled with neon or mercury). Some vocoder models dynamically adjust the mix between buzz and hiss, with a threshold control to bias the balance. While most designs had this noise source passing through the resynthesis filter bank, a unique approach was used by pioneer Harold Bode, who upon detecting unvoiced, routed the actual input signal directly through to the final output and bypassed the encoding and decoding filter banks, which he claimed gave a much more realistic emulation.

Finally, with pitch detection of a fundamental, it is possible to automatically modulate the buzzy VCO carrier frequency to mimic the way our vocal cords change pitch, though that of course defeats any robotic emulations, and hence it is labeled “optional” on the diagram. As mentioned, Dudley's ultimate vocoder sampled the input fundamental only 50 times second (compared to our current 44,100), and so the pitch tracking was described as dicey, jittery and very limited (about a 25 Hz deviation range). Modern polyphonic applications may use multiple VCO carriers to create harmonies which may or may not track the speech input fundamental in parallel, with the carrier frequencies usually controlled by a MIDI keyboard. The 1979 Moog Vocoder allowed any audio input as the carrier, so synths, guitars, drums or any audio recording or live input could be modulated through the filter bank. By creating one’s own carrier pitches then, the vocoder can create singing from spoken text, a technique used extensively by Paul Lansky with a digital linear predictive coding (LPC) technology mentioned on the next page (similar to the different auto-tune technology's idiomatic use).

Audio Channel Vocoder Examples

Vocoder Audio Examples

Excepting the last one, the following examples were made with Cycling74's Max application, using a modified version of Marcel Wierckx's included classic vocoder example patch. As of this writing, the patch, which is based on the Moog 16-channel vocoder architecture, can be located in Extras/Example Overview/MSP/classic_vocoder. The input can be live or an audio file and the controllable parameters include a noise detection threshold (for voiced/unvoiced switching), pulse period for specifying the carrier excitation frequency, pulse width for the carrier excitation voiced pulse train waveform, pulse amplitude, and filter Q for the resynthesis 4-pole filter channels. The patch accepts MIDI input to change the frequency of the voiced carrier pulse train excitation source. If you wish to make your own examples and don't have access Max but own a DAW, you can download the free TAL vocoder plug-in, which allows for polyphonic resynthesis.
Example Description	Audio Example
Multiple settings for different effects
100% voiced (i.e. no switching to noise/hiss)
100% unvoiced (i.e. all noise, no switching to voiced pulse train)
The Dudley voder with pedalled pitch expression from 1939 AT&T World's Fair demonstration video
from Kraftwerk The Robots (1978, remastered 2009)

Chapter Four: Synthesis

13. Speech Synthesis and the Channel Vocoder | Page 4