Advertisement

Detecting gunshots in audio with AI?

Started by January 11, 2006 04:47 PM
19 comments, last by Timkin 18 years, 10 months ago
This isn't related to game programming, but I don't know of any other real AI forums to post in (if you have any suggestions, that would be awesome). Anyhow, I want to build a program that listens to audio and detects if any guns are firing. I want to use a training set to "teach" the program what is and isn't a shot using data from the frequencies and levels of the audio. What data structures would be appropriate for doing this? What should I start looking into? Are there any easy introductions into this type of artificial intelligence?
Here is one way. Basically you sample a window of audio data. You extract features of the signal that help distinguish a gun shot from other noises. Features can be as simple as taking the mean or variance of the data in a window or could be more complicated like taking an FFT and finding peak frequencies. Then, train a neural network with some example features from example windows of audio data (some gunshots some without). Then after the neural network is sufficiently trained, put the neural network online with the feature extractor and window grabber. Anyway, there are other ways...

Another thing you should do is normalize the audio data once you grab a window. You might try, Z-order normalization.

Advertisement
Ok, thanks so much for the reply. Now, I'm really new at AI but not at CS and math, so I need to start reading about these topics. A few questions:

Quote: Original post by NickGeorgia
Here is one way. Basically you sample a window of audio data. You extract features of the signal that help distinguish a gun shot from other noises. Features can be as simple as taking the mean or variance of the data in a window or could be more complicated like taking an FFT and finding peak frequencies.


Ok, that I can definitely do. I have an audio library on hand that will give FFT data, levels, etc.

Quote:
Then, train a neural network with some example features from example windows of audio data (some gunshots some without). Then after the neural network is sufficiently trained, put the neural network online with the feature extractor and window grabber. Anyway, there are other ways...


Ok, I know nothing about neural networks. What are some good resources to start learning about them? What are advantages of this vs. hidden Markov models?

Then, how do I create and train this network?

Quote:
Another thing you should do is normalize the audio data once you grab a window. You might try, Z-order normalization.


Sorry, what does this mean? =)

I'm really new at this but I really want to learn. Thanks again for everything!
Quote:
Quote:
Then, train a neural network with some example features from example windows of audio data (some gunshots some without). Then after the neural network is sufficiently trained, put the neural network online with the feature extractor and window grabber. Anyway, there are other ways...


Ok, I know nothing about neural networks. What are some good resources to start learning about them? What are advantages of this vs. hidden Markov models?

Then, how do I create and train this network?


You can go to my journal. There are a few tidbits here and there about neural networks and a small tutorial way in the back. I think there are some links to other websites. If not, ask me again or try a google search. You don't have to use neural networks. You could use a fuzzy logic expert, decision trees, etc. even simple thresholding if that works sufficiently. As for Markov models, I would say that the main advantages of using a NN,

1. usually NN's require less data (pdf's need lots and lots of data)
2. NN can easily be expanded (adding nodes, etc.)
3. I'm sure there's more...

Quote:
Quote:
Another thing you should do is normalize the audio data once you grab a window. You might try, Z-order normalization.


Sorry, what does this mean? =)

I'm really new at this but I really want to learn. Thanks again for everything!


You want to normalize the data usually since volumes (etc.) might not all be uniform. This is a pre-processing step. You may have to filtering also if the noise levels are especially high.

Edit: Sorry if I'm not detailed right now. I'm a little shot after working all day.
I don't think that neural networks would be the best solution in a case like this ... Frankly, you just need to find some features of a "gunshot" sound which are different than the features of any other sound. Since a gunshot is pretty well-defined, find some sound samples of gunshots and open them in your favorite sound wave editor. Look at their patterns and qualitatively write down some patterns you notice. You may have to apply filters within your sound program to see apparent patterns, or you may even have to run some diagnostics like FFTs which your program can provide. Here, it's really just guess and test.

Now, once you've figured out the general defining characteristic of a "gunshot" (or perhaps there are several), you need to first write a program that will undergo all the filters you used to see the difference visually. Next, use your verbal description to write code which normalizes a segment of the unknown wave and tests it to see if it matches your verbal description.

In my opinion, for a beginner to AI, especially in somewhat vague situations like this neural networks may be a bit confusing to implement and may not achieve great results. Any method you use may make some incorrect classifications, but using the above method you should be able to achieve an excellent hit rate. Also, I should note that since a gunshot will likely be drastically different from "normal" sounds which enter your microphone, the above technique will statistically work out well. If the sounds are harder to pick out by a human, you'd probably have to use some more advanced algorithms.
h20, member of WFG 0 A.D.
I agree, you can do this numerous ways. Finding "the filters" you mention can be difficult though especially if you cannot find good features. Using a classifier such as a neural network can ease the burden a bit (if your familiar with it) since it can try to do the classification for you. Granted, it may not do a good job but it's better than racking your brain over a large amount of feature data IMHO. There are of course other ways and if you can find some good distinguishing characteristics use them and you may of course not choose to use a neural network.

Another technique one could use is a clustering method, for example, fuzzy c-means clustering.

I just mentioned the neural network since he/she wanted an AI technique, and I thought neural networks might be useful especially in the case where good features are elusive. It's all about the features. Anyway, have fun.
Advertisement
Quote: Original post by mnansgar
I don't think that neural networks would be the best solution in a case like this ... Frankly, you just need to find some features of a "gunshot" sound which are different than the features of any other sound. Since a gunshot is pretty well-defined, find some sound samples of gunshots and open them in your favorite sound wave editor. Look at their patterns and qualitatively write down some patterns you notice. You may have to apply filters within your sound program to see apparent patterns, or you may even have to run some diagnostics like FFTs which your program can provide. Here, it's really just guess and test.

Now, once you've figured out the general defining characteristic of a "gunshot" (or perhaps there are several), you need to first write a program that will undergo all the filters you used to see the difference visually. Next, use your verbal description to write code which normalizes a segment of the unknown wave and tests it to see if it matches your verbal description.

In my opinion, for a beginner to AI, especially in somewhat vague situations like this neural networks may be a bit confusing to implement and may not achieve great results. Any method you use may make some incorrect classifications, but using the above method you should be able to achieve an excellent hit rate. Also, I should note that since a gunshot will likely be drastically different from "normal" sounds which enter your microphone, the above technique will statistically work out well. If the sounds are harder to pick out by a human, you'd probably have to use some more advanced algorithms.


I tried this, and I got a fairly accurate algorithm out of it. The problem is that I need a really accurate algorithm. There were just too many false positives and negatives, even when I looked at different frequencies and characteristics of the shot. For example, that low shuffling noise when someone runs with a microphone would trigger it (in like a 3 second clip it might trigger once, for example), and it's nearly impossible to get all the characteristics down IMHO.

I would like to experiment with the neural nets, and I will start reading about them. If there is anything I'm overlooking with respect to hardcoding the spectrum and level values that will work well, I can check that out too.

I figure if a neural network can separate and process speech, it can do a fairly distinct type of sound pretty well too.
Trying to reduce false positives and negatives can be a really difficult problem. One way is to use a technique to optimize the features you use (feature selection, etc.) Another is to handle uncertainty in some fashion such as using probabilties, possibilities (fuzzy), or evidence (Dempster-Shafer). That way you can say, it's a gun shot with a certain probability (possibility, degree of certainty, etc.) Try fuzzy-cmeans clustering for this method, it's not that difficult. Another is to attempt to surpress the interferring signals through pre-processing (filtering, etc.) I'm just throwing some ideas your way.... lots more and that's what makes it fun.

And once again, remember if you have crummy features, you can only do so much. So you should try to handle the uncertainty in some fashion if this is the case. Think what the people who try to detect seizures from EEG's before they happen have to go through. Egads! Put my nickname down in the patent will ya? hehe J/K

Edit: also look into Particle Filters (it's all the rage)

[Edited by - NickGeorgia on January 11, 2006 11:35:08 PM]
Quote: Original post by NickGeorgia
Trying to reduce false positives and negatives can be a really difficult problem. One way is to use a technique to optimize the features you use (feature selection, etc.) Another is to handle uncertainty in some fashion such as using probabilties, possibilities (fuzzy), or evidence (Dempster-Shafer). That way you can say, it's a gun shot with a certain probability (possibility, degree of certainty, etc.) Try fuzzy-cmeans clustering for this method, it's not that difficult. Another is to attempt to surpress the interferring signals through pre-processing (filtering, etc.) I'm just throwing some ideas your way.... lots more and that's what makes it fun.

And once again, remember if you have crummy features, you can only do so much. So you should try to handle the uncertainty in some fashion if this is the case. Think what the people who try to detect seizures from EEG's before they happen have to go through. Egads! Put my nickname down in the patent will ya? hehe J/K

Edit: also look into Particle Filters (it's all the rage)


Ok, I'm wondering if I can just pick up a neural network library, throw training data at it and let it do its magic?

I found a library for .NET at http://www.cdrnet.net/projects/neuro/. Is there a better one to use that you know of?

Anyhow, I'm just completely confused by how to set this thing up and run it. I'm also confused about something: if a gunshot is played over let's say 50ms, and I have 5 sets of sample data taken at 10ms intervals, how do I feed it to the network such that it realizes it's all part of one sound? Do I just make a giant vector with all 50ms of data?

This is really confusing me =(.
It looks like that the AI research community now uses support vector machines (try SVM in google or wikipedia) in place of neural networks. I am not a specialist in NN but from what I know, SVMs have a more mathematic and pragmatic approach of the problem. It is an algorithm to approximate a mutidimensional continuous function thanks to examples (that may be partly erronous). It looks a bit harder to use, but if you are not afraid of maths, you should be able to control better the learning process.

One word on the Markov Models : they have the goal to recognize a sequence of inputs, where NN or SVM take all their inputs without regard to their succession. Markov models are heavily used in speech recognition.

Is it better to consider a gunshot as a sequence of samples or as a single event ? it is shorter than a word but maybe can it be learnt as something like
(saturation - fast decay - one or more echoes) which would justify a Markov approach. Or maybe is it more like :
(a window of 1 seconds, with a peak in the 180-220 Hz range, a mean level very high at 0.05 seconds of t0)

You decide, you are the specialist :-)

This topic is closed to new replies.

Advertisement