Filters an input power spectrum through a bank of number of mel-filters. The output is an array of filtered values, typically called mel-spectrum, each corresponding to the result of filtering the input spectrum through an individual filter. Therefore, the length of the output array is equal to the number of filters created.
The triangular mel-filters in the filter bank are placed in the frequency axis so that each filter's center frequency follows the mel scale, in such a way that the filter bank mimics the critical band, which represents different perceptual effect at different frequency bands. Additionally, the edges are placed so that they coincide with the center frequencies in adjacent filters. Pictorially, the filter bank looks like:
Figure 1: A Mel-filter bank. As you might notice in the above figure, the distance at the base from the center to the left edge is different from the center to the right edge. Since the center frequencies follow the mel-frequency scale, which is a non-linear scale that models the non-linear human hearing behavior, the mel filter bank corresponds to a warping of the frequency axis. As can be inferred from the figure, filtering with the mel scale emphasizes the lower frequencies. A common model for the relation between frequencies in mel and linear scales is as follows:
melFrequency = 2595 * log(1 + linearFrequency/700)
The constants that define the filterbank are the number of filters, the minimum frequency, and the maximum frequency. The minimum and maximum frequencies determine the frequency range spanned by the filterbank. These frequencies depend on the channel and the sampling frequency that you are using. For telephone speech, since the telephone channel corresponds to a bandpass filter with cutoff frequencies of around 300Hz and 3700Hz, using limits wider than these would waste bandwidth. For clean speech, the minimum frequency should be higher than about 100Hz, since there is no speech information below it. Furthermore, by setting the minimum frequency above 50/60Hz, we get rid of the hum resulting from the AC power, if present.
The maximum frequency has to be lower than the Nyquist frequency, that is, half the sampling rate. Furthermore, there is not much information above 6800Hz that can be used for improving separation between models. Particularly for very noisy channels, maximum frequency of around 5000Hz may help cut off the noise.
Typical values for the constants defining the filter bank are:
Sample rate (Hz) | 16000 | 11025 | 8000 |
{@link #PROP_NUMBER_FILTERS numberFilters} | 40 | 36 | 31 |
{@link #PROP_MIN_FREQ minimumFrequency}(Hz) | 130 | 130 | 200 |
{@link #PROP_MAX_FREQ maximumFrequency}(Hz) | 6800 | 5400 | 3500 |
Davis and Mermelstein showed that Mel-frequency cepstral coefficients present robust characteristics that are good for speech recognition. For details, see Davis and Mermelstein,
Comparison of Parametric Representations for Monosyllable Word Recognition in Continuously Spoken Sentences, IEEE Transactions on Acoustic, Speech and Signal Processing, 1980 .
@see MelFilter2