Chapter Four: Synthesis

11. Convolution: a form of cross-synthesis

Cross-synthesis is a technique whereby one signal confers one or more of its characteristics onto another. Convolution is a method of cross-synthesis, combining two audio sources in such a manner that, in the frequency domain, those frequencies they have in common will be emphasized proportionately, and those they do not share will be minimized. In the time domain, the way in which those frequencies hang around, get smeared, die away is another part of the convolution process. The sources may both be digital audio files of finite length, or one finite file and the other real-time input (potentially infinite). Convolution is now used commonly for higher-quality reverbs (called convolution reverb—the MOTU Digital Performer convolution reverb plug-in is called ProVerb), for filtering, and for giving a particular sound file certain characteristics of another (talking crash cymbals, for example). Convolution modifies sound in both the frequency domain and the time domain. For two fixed files, the convolution output duration (using orthodox computation methods) will be the total duration of one added to the total duration of the other minus one sample. For convolution reverb, the output sound will ring or sound past the end of the input signal for the length of the impulse file minus one sample.

When discussing convolution theory, one of the signal sources is normally called the input signal (IS) and the other, the unit impulse or impulse response (IR) file.  The signal input (IS) may be a set of finite audio samples (i.e. pre-recorded digital audio) or a potentially infinite audio signal from a DAW track or a microphone.  The impulse response (IR) file is almost always a fixed length, and altering it during convolution creates unwanted glitching for most processing techniques (try altering the impulse response length parameter while playing a file through it in Digital Performer to confirm). An impulse file may simply be another audio file of equal import to the IS, in which case we would consider this cross-synthesis, or it may be a true impulse response file, created via the method described below.

True impulse response (IR) files represent the characteristics of an acoustic space minus the actual sound used to determine those characteristics.  Typically, either a very short sine sweep file called a blip and/or a short and sharp reference sound called the impulse (could be from a starter’s pistol or a balloon popping) is set off in a room, concert hall, flower pot, beer can or whatever space one wishes to model. The resulting sound, including the frequency reflection and reverberation (spectral response) is recorded by a special reference microphone or multiple microphones.  Then the original impulse of the gun or balloon, etc., is erased from the resulting sound file and what is left is the impulse response (IR) of the space--play the middle IR file below to hear what a typical room IR sounds like. Some reverb packages, such as Altiverb or Logic Pro have their own built-in IR creation capability. Watch an interesting video on the creation of an IR for Altiverb here, with links to downloading sweeps and impulses.

Any dry sound convolved with the IR will sound as if it were performed in the IR space.  So for example, if you had the IR of the Amsterdam Concertgebouw hall (I do) you could convolve it with your dog's bowwow and get the result of your dog barking on this famous stage--further puns are up to you. Note that the 'dry' dog bark file is 2 seconds long, the IR file is 4 seconds long, and the resultant convolved file, by the time the reverberation dies away is about 6 seconds long, which is what one might expect from the math below.

The chart below demonstrates these relationships.

Dog Bark Dry Concert Hall IR Resultant Convolved File (Concertgebowwow)

However, if we take the same dog bark input and convolve it with the IR of a flower pot, we get a far different output result. And much shorter too, as the flower pot has virtually no reverberance, hence a short IR.

The chart below demonstrates these relationships.

Dog Bark Dry Flower Pot IR Resultant Convolved File

The Internet is full of free impulse response files, though many of them are of questionable quality. Many DAWS either come with quality convo reverbs and IRs, or one can purchase convo reverb plug-ins that come with quality IRs. In addition, while you may get a very good IR, remember it is only for one spot in the hall, either onstage or in the audience, usually dependent on the measuring equipment used, with the hall and stage empty rather than filled with people, and so forth. A great collection of IRs from a single hall in Finland, with well-documented measurement details can be found here. High quality convo reverb developers and certain halls like Carnegie Hall guard their IRs very carefully.

Theory and Practice

Theory

Discussions of digital signal processing math frequently use the following conventions:

To reference a specific sample in the audio file F, a sample may be referred to directly by its position with a bracketed [n]. F[n] indicates the nth sample value of the file F. F[0] indicates time 0 and/or the first sample. Additionally, a capital N is used to indicate the total number of samples in the file (starting with sample # 0).

The actual mathematical basis for convolution, called direct convolution is rarely used in its practical audio implementation. The calculations required are too time-consuming for all but the shortest files and there are more efficient methods listed below. Convolution is actually a modified mathematical summation procedure, using matrix math. If we take a 4-sample input signal (IS) file and a 1-sample impulse response (IR), the formula would be:

Output[n] = IS[n] * IR[0] for n = 0, 1, 2, 3...N

If the hypothetical sample values of the input signal (IS) were 3 4 1 2 and the single value of IR was 2, the output file would be 3*2, 4*2, 1*2, 2*2, or 6 8 2 4. In this case, the impulse response file becomes simply an amplitude scaling factor and we've only doubled the amplitude of the samples, not changed any spectral or time domain characteristics.

However, when there is more than one sample in the IR file, the mathematical procedure becomes slightly more involved. There are many sources (I highly recommend Curtis Roads: Computer Music Tutorial) where you will see the specific formula for direct convolution of two finite files. However, if epsilons or integrals scare you, here is a verbal step by step description of the direct convolution process for two files IS and IR of different lengths.

IS = 2 4 3 6     IR = 1 5 2 3 4

Step 1: multiply every value of IR by SI[0] (or 2) and write the result out in a line. You should get:

2 10 4 6 8

Step 2: on top of that line, skip in one place and multiply every value of IR by SI[1] (or 4), which should give you:

  4 20 8 12 16
2 10 4 6 8  

Step 3: continue this procedure until all values of SI have been exhausted, which should give you:

      6 30 12 18 24
    3 15 6 9 12  
  4 20 8 12 16    
2 10 4 6 8      

Step 4: finally, sum the vertical columns to get the final output samples.

2 14 27 35 56 37 30 24

Note that the amplitude range of the output file values is MUCH higher than either the input signal or IR values, an expected result of the multiplication and summing. It is a standard practice for convolution processes to end with what is called post-normalization, where the output file is multiplied by a reducing scaling factor to keep the sample values in range. Convolution is also best done with floating point values, where many more values are available for the pre-normalized output before being reduced to 16- or 24-bit numbers through post-normalization. For real-time convolution, without knowing the maximum amplitude that may be generated, controlling the input and output levels of a convolution reverb becomes very important, as they are easily overloaded and clip.

An Excel spreadsheet for calculation the output of an IS and IR (up to 8 values each) is provided here. For shorter files, use 0's or leave cells blank.

Practice

In practice, convolution is most often done by computing FFT's or spectral analyses of both the input and IR files, and multiplying their spectra together. This is called fast convolution and saves orders of magnitude of computer operations, thanks to the efficiency of the FFT and DFT. Curtis Roads in The Computer Music Tutorial (p. 424) states that the convolution of two waveforms is equal to the multiplication of their spectra. Also that the inverse is true, in that multiplying two waveforms, as we did with direct convolution above is equal to the convolution of their spectra. The beauty of FFTs, as noted in my previous chapter, is that each file is windowed into frames of 512, 1024, 2048 samples as so forth, so larger blocks can be efficiently analyzed by an FFT. The process is relatively simple. First, each file, the IS and IR are 'zero-padded,' with zeros being added to the ends until they reach the length of IS + IR -1. Then, each IS frame's spectrum is multiplied by the corresponding IR frame's spectrum. The output is created by means of an IFFT, or inverse FFT, similar to phase vocoding.

Technically, this is the tip of the convolution iceberg, as there are many variants of this method depending on the relative length of the signals to each other and the needed speed of processing. One may investigate terms such as circular convolution (the most-used fast convolution algorithm), the overlap-save method which overlaps blocks of the IS and overlap-add methods which overlaps blocks of the convolution output. Each overlap by a factor of about 1/5, so for a 1024-size block, about 200 samples would be overlapped by the next frame. Please note that while these are very interesting, particularly if your math is strong, they are not essential for making good creative use of convolution, as the software tools we employ have already incorporated one of these methods, though a few, like SoundHack, give you some choices in how they are fully implemented.

At least as of the time of this article, it is not possible to take a convolved output file and deconvolve it into its constituent input files without knowing the structure of at least one of the component files or being able to estimate it. If so, many of us would be thrilled we could remove the excess reverb from our overly-ambient hall recordings. Some signals, such as noise or hum, can be roughly estimated, and deconvolution procedures for removing them are available through commercial software. Some blind deconvolution work using statistical analysis is being done, particularly as it relates to the cocktail-party effect.