Computes the Discrete Fourier Transform (FT) of an input sequence, using Fast Fourier Transform (FFT). Fourier Transform is the process of analyzing a signal into its frequency components. In speech, rather than analyzing the signal over its entire duration, we analyze one
window of audio data. This window is the product of applying a sliding Hamming window to the signal. Moreover, since the amplitude is a lot more important than the phase for speech recognition, this class returns the power spectrum of that window of data instead of the complex spectrum. Each value in the returned spectrum represents the strength of that particular frequency for that window of data.
By default, the number of FFT points is the closest power of 2 that is equal to or larger than the number of samples in the incoming window of data. The FFT points can also be set by the user with the property defined by {@link #PROP_NUMBER_FFT_POINTS}. The length of the returned power spectrum is the number of FFT points, divided by 2, plus 1. Since the input signal is real, the FFT is symmetric, and the information contained in the whole vector is already present in its first half.
Note that each call to {@link #getData() getData} only returns the spectrum of one window of data. To display thespectrogram of the entire original audio, one has to collect all the spectra from all the windows generated from the original data. A spectrogram is a two dimensional representation of three dimensional information. The horizontal axis represents time. The vertical axis represents the frequency. If we slice the spectrogram at a given time, we get the spectrum computed as the short term Fourier transform of the signal windowed around that time stamp. The intensity of the spectrum for each time frame is given by the color in the graph, or by the darkness in a gray scale plot. The spectrogram can be thought of as a view from the top of a surface generated by concatenating the spectral vectors obtained from the windowed signal.
For example, Figure 1 below shows the audio signal of the utterance "one three nine oh", and Figure 2 shows its spectrogram, produced by putting together all the spectra returned by this FFT. Frequency is on the vertical axis, and time is on the horizontal axis. The darkness of the shade represents the strength of that frequency at that point in time:
Figure 1: The audio signal of the utterance "one three nine oh".
Figure 2: The spectrogram of the utterance "one three nine oh" in Figure 1.