Advertisement

Voice recognition and text to sound conversions--a long shot?

Started by April 15, 2001 08:03 PM
2 comments, last by BobInTown 23 years, 9 months ago
Ok, here goes and its probably a long shot. Im designing a game to be playable over the internet and here''s the problem that we may have all come across at one time or the other. First its annoying to have to type in everything you say, especially in the heat of battle when taking time to type something in could mean being killed by another player. Second an alternative to typing is using some kind of voice technology like Roger Wilco. But the problem with this is when you have younger players or people with (sorry but no other way to say it) geeky or high voices. Obviosly this takes away from the seriousness and realism of the game if your player is some kind of marine grunt. So while I was brainstorming I thought of possible using voice recognition technology to overcome this. Basically someone would say something, then that would be converted to text, and then it would be outputted in a new voice by the computer. Now I know converting voice into text can be done as I''ve seen it in may programs. I also know text to voice is possible, from applications like simple text for the macintosh. However it is outputted as a very monotone voice. If the ability to output the voice as something other than monotone cant be overcome I can still work around it. So I have two basic questions. First: are there any SDKs out there or can I use a combination of SDKs to help me with this. If there are it would also be nice if there were any that supported the ability to add accents, yelling, lower or higher tones, male or female, etc. Second: If there are no SDKs what kind of books are there out there for this. I would prefer SDKs however. P.S. If it makes any difference I am programming in C++
Current Projects - +Tactical Assault: A Half- Life Mod: tactical-assault. tripod.com
I've given some thought to this problem as well. Needing to type out messages is tedious, lessens immersive qualities, and often leads to people getting killed off whilst typing. One of the solutions, as you have stated, would come from Roger Wilco like abilities. However, sending sound to seven other clients would take up a lot of bandwidth, and, as stated earlier, this solution would invlove squeaky voices from squeaky people and bad mics.

The solution you came up with would work in theory, however unless certain voice characteristics are taken from the user at time of recording, it would be impossible for the computer to correctly determine how the frases should be spoken. I do not know of any voice to text API that knows how to take these characteristics into consideration, nor how one would go about writing one.

My suggestion:
Convert the voice into text, send it to all of the clients, and have the clients visually display the text.

Inigmas

Edited by - Inigmas on April 15, 2001 9:35:57 PM
Advertisement
[[ THEORY ALERT ]]
I also thought about that a year ago, while some of the gang wanted to make a three-D chat... Here is what I got (the project fallen before the engine was ready...) when I thought about all this:

Theorically, it is easier to take some phonetical characters (I will come with it later) transition to voice signal. The best you could do is to record some (with your own voice) characters, separate them, and then adjust the pitch, the speed and the space between them to "simulate" different voices. The biggest problem is the transition from one character to another. For this you will have to smooth the end and start of the voice record to make "as if" they were pronounced. Since they are phonetic and not letters, it will be easier.
The best I could suggest is to send the voice with phonetical symbols rater than text because you won''t have to check mistakes, word end and etc. All you will have to send is the "pronounciation" (phonetic value), the length, and blanks. If you do this well, no one will ever remark that it isn''t text, and better no one will ever remark that it is only one big "word" that is send. Pronounciation is easier too to simulate when coming to "speak". Better, if you ever give you proggy to some japanese or russian, the program will still works, since it doesn''t take meanings, only pronounciation. The more symbol you put in the bank (I think the phonetic grammar has about 50 symbols), the more you will be compatible with other language like French, spannish, russian, croate... etc.

For taking voice to text, I will let you on your own, because I don''t have much ideas. I never really thought about it And voice recognition is not still my domain of study...

I''m sorry, but that''s all I can do. I don''t know of any voice recognition packages, or text to speak examples (except that old stupid dos proggy that was very very monotone and used PCSpeaker). Maybe you should search on AI studies and books. I think voice recognition is part of it.

Hope I''ve guided you on a way.
Now I know what I'm made of, and I'm afraid of it...
I work in the field of voice recognition, though I''m mostly in support right now (building VR test simulations, not algorithm development). However, I''m studying for a master''s in signal processing with the intention of going into VR apps.

Voice-to-text conversion is a difficult process. Usually, you''re talking about breaking down the phonemes, then searching through a dictionary. This is very CPU-intensive. Likewise, exciting a voice synthesizer also requires a decent amount of CPU time. Both of these problems require many thousands of "multiply and accumulate" (MAC) instructions per second. This is why Digital Signal Processors (DSPs) are used in speech processing so much--they''re a special kind of chip that can (nowadays) do several pipelined MAC instructions in very few clock cycles.

Also, using this method you''ll be completely losing any inflection in the voice at all, not to mention having problems with translating slang or exclamations ("woohoo" or "booyah").

If the problem is an immature-sounding voice, there is a field of speech processing that deals with mutating a voice. Basically, the voice is filtered to sound like Darth Vader or a chipmunk or whatever. This might be a better path to explore. Of course, if you''re going to transmit this over a network, you''ll have to look at vocoding (voice encoding/decoding). But maybe this is a more realistic path.

This topic is closed to new replies.

Advertisement