Advertisement

fourier transforms

Started by January 27, 2003 12:58 PM
20 comments, last by walkingcarcass 22 years ago
Apparently "fourier transforms" can be used to analyse sound. There''s a comprehensive explanation of FFTs here. I understand the math. I understand that by extracting and removing certain frequencies, the sound is easier to examine. What I don''t get is how to apply all this math to a stream of audio data. The bytes themselves represent the volume at any particular time, right? Given audio data, how are frequencies found, measures, extracted and filtered? When you extract a frequency, does it include nearby frequencies in lesser amounts, like the response of a radio tuner? I''d much appreciate any links on how to use this processed sound in sound recognition etc. Thanks. ******** A Problem Worthy of Attack Proves It''s Worth by Fighting Back
spraff.net: don't laugh, I'm still just starting...
I don't really know the answers to your questions. I'm responding simply because the link you provided explaining Fourier analysis was something I've been looking for for a while. Thanks for putting it in your post! I, unlike you, do not understand the math. I've personally been going ahead of the AP Calculus curriculum, but not that far ahead!

Incidently, I may be able to help you with your technical questions a little though. I just like to think of samples as the y values of the pressure/time graph of the sound wave. They represent voltages, which themselves represent pressures. You probably already knew that. I don't think you can call the samples "instantaneous volume," since, although two microphones in different pressure chambers may record different pressures, there is no sound in either. Maybe volume has to do with the derivative?

In terms of sound recognition, although I have to warn you it's it's not something I have much experience with, I assume it's just a matter of comparing frequency spectra between the "unknown" sound and a "known," classified sound. Something as simple as a Pythagorean distance (between the two Fourier-transformed position vectors) would probably work; if the sound recorded by the microphone is closest to that of the "chainsaw" clip in your catalogue, chances are it's of a chainsaw.

Just some thoughts while you wait for someone more qualified to respond.

[edited by - TerranFury on January 27, 2003 4:16:59 PM]

[edited by - TerranFury on January 27, 2003 4:22:12 PM]
Advertisement
Fourier Transforms are used, AFAIK, to convert from the amplitude domain the the frequency domain. They use a sliding window of data samples that are then pushed through the transform. This is how they deal with streams; the stream is marched through the window of, say, 2048 samples. When transformed the data is available as a graph of frequency in one axis and intensity in the other. This lets you do things like identify whether a stream has a lot of high frequency content, or low, or how often a low bass sound is played for things like electronic beat matching.
Why do I always forget to include the obvious things in my answers? (Good call, Sphet)
okay, thanks, but i still don''t understand the implementation

********


A Problem Worthy of Attack
Proves It''s Worth by Fighting Back
spraff.net: don't laugh, I'm still just starting...
Well, in pure mathemathics, you would need to get an infinite stream of audio data. And then, after infinite time has gone apply the Fourier transform to the whole set of data to get a perfect transformation from amplitude domain to frequency domain.

Obviously this is not possible in real world. What engineers, physicist, etc... do is to get only a finite portion of audio data, as much greater as posible, to apply the transform.

If you use recorded data you can transform all the data once. If you use online audio you could use a small quantity of data (as Sphet has said) with a little delay (think that one second or less is a huge quantity of data).
Advertisement
Fourier analysis in the real world works like this:

Take a snapshot of data, it will contain a number of samples, representing the amplitude at each interval in time. You then compare this with sine waves of increasing frequencies.

Imagine your data looks similar to a 30 Hz sine wave, multiplying the individual samples with a 30Hz wave''s samples, and then adding them all up would give a high value. If you multiplied all the samples with a 15Hz sine wave''s samples, then only some of the peaks and troughs would coincide and the summed total would be less. Repeat this for a number of frequencies and you get the spectrum of the signal, with the peak at 30Hz.

This is a very simplistic way of imagining the Fourier transform, as it actually uses complex numbers to describe signals properly, taking the phase into account.

Essentially its best to think of the Fourier transform as "matching" your signal with a bunch of sine waves, with the best match giving the highest value, and hence a peak in the spectrum.

PS. For any mathematicians, I know I have missed out many important factors such as the requirement for orthogonal bases, Nyquist limits and aliasing, but I like the idea of being able to get the gist of a function, instead of it just being a black box "transform"
quote:
Original post by walkingcarcass
Apparently "fourier transforms" can be used to analyse sound.
What I don''t get is how to apply all this math to a stream of audio data. The bytes themselves represent the volume at any particular time, right?


Yes, although we like to use the term "amplitude". If you''re doing this yourself, and looking at .WAV data, be aware:
8-bit wav data uses an unsigned char (127 = 0 amplitude).
16-bit wav data uses a signed short (0 = 0 amplitude).
(I just mention it because it seems like everyone hits that land-mine.)

quote:

Given audio data, how are frequencies found, measures, extracted and filtered?


Heh. Well, if you understand the math, the FT measures how much each sinusoid contributes to the signal. So (in the nomenclature of your article) each summation V(k) will be the magnitude and phase of the sinusoid at frequency e^(-2*pi*i*k/N). Therefore the magnitude of each V(k) is the strength of the sinusoid at that frequency.

So, frequencies are found and extracted by applying the FT. As far as measuring, the units are pretty tricky and depend on how many data samples you''ve taken. In audio applications, the FT is usually normalized to dB, which is actually trickier because it needs a 0dB reference (a decibel is the difference between two numbers, so you need a starting point). You can try googling for this, but usually you just want to look at the signals relative to each other in the beginning so it''s not necessary up front.

As far as filtering, that''s a whole other bag of worms. Filtering is a hu-u-u-u-u-u-u-ge subject. What kind of filtering did you want to do?

quote:

When you extract a frequency, does it include nearby frequencies in lesser amounts, like the response of a radio tuner?


Sometimes. The FT is a lossless transform--all of the information will be there. However, the amount of data you use determines how fine your frequencies will be. Think of it this way--each V(k) represents a frequency "bin", and the entire signal has to go into that bin. Now, if you only have a single sinusoid in your signal, and it happens to be at a frequency that it''s directly in the center of one of your bins, your FT will have a magnitude for one V(k) and zero for all the rest. However, if it''s off-center, it will leak into lots of other bins. This is called "spectral leakage". It won''t go directly into the bins right next to it, either--it will go all over the place. In that way, it''s not like a radio tuner. In order to fix this, we usually "window" input data before applying the FFT, which has the effect of smearing the data in the frequency domain but cuts down on leakage far from the original signal. It''s really easy to apply windows--look on the web for the "Hann" or "Hamming" window.
quote:

I''d much appreciate any links on how to use this processed sound in sound recognition etc. Thanks.


What kind of sound would you like to recognize? If you''re looking for speech recognition, the FFT is inadequate. Most speech synthesis/compression/recognition engines use an ARMA (auto-regressive moving average) model, which is a high-precision frequency analysis tool & a bit harder to understand than the FFT.

Think Liberally..
I often have the impression that people that knows very well FT (as used in mathematics: to solve differential equations etc) will have problem to communicate with people only dealing with DFT or FFT.

Very important is the usage of windows (hamming,hanning(!) etc) since it''s the only way to make a useful thing out of the Fourier Transform. The original transform needs a function to be defined from t=-eternity to t=eternity. Not very easy to implement.
If you decide to take a 10 second sample of the sound starting from now, then you have already used a rectangular window that will have inpacts on the frequency domain.

SO WHAT CAN YOU DO?
Well let''s say you''ve got the soundsamples s(t) for the menioned period above then you can generate S(f)=FFT(s(t)).Voila!
Now you can do what you want to do (let''s say take away everything below 200Hz).
(This filtering operation is a multiplication of S(f) with F(f)in freq.domain)(F being the filter)

Finally you just generate new_s(t)=INVERSE_FFT(S(f)) and send that to your sound card.

But, naturally this is not what you should do in practice:
you do a
new_s(t)=CONVOLVE(s(t), f(t)), where f(t)=INVERSE_FFT(F(f))
which is faster in most cases (specifically then f(t) is "short"

The convolution is just a Finite Impulse Response filter (FIR).
You can spare cpucycles by implementing a IIR (Infinite...)filter, but here you are leaving the safe trails a little).

And when you master sliding windows FFT you are ready to take off for Wavelets! It becomes quite CPU-intensive though.

For a cheap=free and good MATLAB-lookalike look at
http://www-rocq.inria.fr/scilab/scilab.html
It will make you a better human... ...having all those fancy filters at hand.

/Ola





oh dear my brain still lacks the eureka monent in this...

let''s say i have a buffer of A amplitude samples, with a rate of B samples per second, and the buffer represents A/B seconds of sound

at any point buffer[X], how much of the signal is made up of a C hertz sine wave?

********


A Problem Worthy of Attack
Proves It''s Worth by Fighting Back
spraff.net: don't laugh, I'm still just starting...

This topic is closed to new replies.

Advertisement