It's 2009 why cant we do real sounding text to speech?
I think it would be really nice for indie game developers if there were a realistic sounding text to speech API, why has text to speech not been perfected yet? Is it even possible to make text to speech which sounds real?
Remember Codeka is my alternate account, just remember that!
Most languages suck, from a logical parsing standpoint would be my guess. Probably couple that with the extreme difficulty in implementation as opposed to using canned speech. There are so many rules and exceptions to the rules regarding intonation and such that it seems there's not been enough justification to codify them into an engine.
I think it would be quicker to program a self learning AI, teach it to program and command it to write a text to speech program for you. I really think this is how it will eventually end up being perfected, and programmers all across the world will ePenis--; in shame because a computer, their former slaves, is smarter than them.
I think that the problem lies in the subtleties of accents. You can easily program all grammatical variations, phonetic intonations and word combinations into a computer and expect clear and understandable speech to come out, but it will always sound like a computer because it has no accent.
Consider the word "butter" (buht-er): a computer will say it correctly as specified by phonetic stresses, etc, but a human will mess it up and say "buttah" or "buttr" or something, which gives it that human touch that machines have a problem mimicking. You could try to program in a degree of slurring, but then you'd have to do it on a per-word basis, since the way people say "butter" is probably different than the way they say "buttress".
[edit]
I forgot to add the concept of context-specific variations: people usually say the same words differently depending on the context. For example:
"Pass the butter" vs "I hate butter"
Consider the word "butter" (buht-er): a computer will say it correctly as specified by phonetic stresses, etc, but a human will mess it up and say "buttah" or "buttr" or something, which gives it that human touch that machines have a problem mimicking. You could try to program in a degree of slurring, but then you'd have to do it on a per-word basis, since the way people say "butter" is probably different than the way they say "buttress".
[edit]
I forgot to add the concept of context-specific variations: people usually say the same words differently depending on the context. For example:
"Pass the butter" vs "I hate butter"
I think the simplest "solution" would be to require that the input text not be plain English, but rather that it be in IPA or something like that. That way, you could handle all of the variations as well as different accents and so on.
For an example of a pretty decent text-to-speech, see here.
That service actually lets you enter IPA as well, though I struggled to enter anything more complex than the IPA for "cat":
For an example of a pretty decent text-to-speech, see here.
That service actually lets you enter IPA as well, though I struggled to enter anything more complex than the IPA for "cat":
<phoneme alphabet="ipa" ph="kæt"> </phoneme>
Yes I had a look recently at synth speech when one of the natal demos were shown
to see if that like practically everything else with the videos + was faked + you guessed it, it was!
to see if that like practically everything else with the videos + was faked + you guessed it, it was!
I think one big problem is that in order to correctly inflect syllables you must understand what is being said. I don't think we'll have good text-to-speech until we have good natural language parsers.
Quote: Original post by zedzWho ever said the Natal demo was doing speech synthesis? They likely just recorded some lines (much like every video game) and had the computer play the pre-recorded audio.
Yes I had a look recently at synth speech when one of the natal demos were shown
to see if that like practically everything else with the videos + was faked + you guessed it, it was!
Quote: Original post by smr
I think one big problem is that in order to correctly inflect syllables you must understand what is being said. I don't think we'll have good text-to-speech until we have good natural language parsers.
I'd have to agree here. Think about all the different subtle variations a simple sentence can be said, depending on the context. There are way more variations in inflection than one might think, depending on mood and meaning, that we subconsciously parse when understanding speech.
And last I checked, natural language parsers are a difficult problem in AI. Possibly in the class of the most difficult problems in AI, actually.
Quote: Original post by NickGravelynQuote: Original post by zedzWho ever said the Natal demo was doing speech synthesis? They likely just recorded some lines (much like every video game) and had the computer play the pre-recorded audio.
Yes I had a look recently at synth speech when one of the natal demos were shown
to see if that like practically everything else with the videos + was faked + you guessed it, it was!
tsk! Letting the idea of facts and logical thought should never get in the way of bashing a new piece of technology!
I mean, come on, we are on a gamedev site; god forbid we embrace new technology or at least wait until it's had a decent airing which is not purely publicity driven before jumping to conclusions!
This post brought to you with a HUGE amount of sarcasm and a general 'meh'ness towards the bashing of every new idea which ever occurs; god bless game development and it's NIH syndrome. May 1001 wheels be reinvented daily!
This topic is closed to new replies.
Advertisement
Popular Topics
Advertisement