It's 2009 why cant we do real sounding text to speech?

Started by CodaKiller July 19, 2009 02:25 AM

19 comments, last by CodaKiller 15 years, 3 months ago

Author

108

July 19, 2009 02:25 AM

I think it would be really nice for indie game developers if there were a realistic sounding text to speech API, why has text to speech not been perfected yet? Is it even possible to make text to speech which sounds real?

Remember Codeka is my alternate account, just remember that!

HostileExpanse

120

July 19, 2009 02:38 AM

Most languages suck, from a logical parsing standpoint would be my guess. Probably couple that with the extreme difficulty in implementation as opposed to using canned speech. There are so many rules and exceptions to the rules regarding intonation and such that it seems there's not been enough justification to codify them into an engine.

awefdbgb

211

July 19, 2009 02:50 AM

I think it would be quicker to program a self learning AI, teach it to program and command it to write a text to speech program for you. I really think this is how it will eventually end up being perfected, and programmers all across the world will ePenis--; in shame because a computer, their former slaves, is smarter than them.

michaee

115

July 19, 2009 07:05 AM

I think that the problem lies in the subtleties of accents. You can easily program all grammatical variations, phonetic intonations and word combinations into a computer and expect clear and understandable speech to come out, but it will always sound like a computer because it has no accent.

Consider the word "butter" (buht-er): a computer will say it correctly as specified by phonetic stresses, etc, but a human will mess it up and say "buttah" or "buttr" or something, which gives it that human touch that machines have a problem mimicking. You could try to program in a degree of slurring, but then you'd have to do it on a per-word basis, since the way people say "butter" is probably different than the way they say "buttress".

[edit]
I forgot to add the concept of context-specific variations: people usually say the same words differently depending on the context. For example:
"Pass the butter" vs "I hate butter"

Codeka

1,239

July 19, 2009 07:37 AM

I think the simplest "solution" would be to require that the input text not be plain English, but rather that it be in IPA or something like that. That way, you could handle all of the variations as well as different accents and so on.

For an example of a pretty decent text-to-speech, see here.

That service actually lets you enter IPA as well, though I struggled to enter anything more complex than the IPA for "cat":

<phoneme alphabet="ipa" ph="kæt"> </phoneme>

War Worlds • Journal

zedz

291

July 19, 2009 01:56 PM

Yes I had a look recently at synth speech when one of the natal demos were shown
to see if that like practically everything else with the videos + was faked + you guessed it, it was!

smr

2,468

July 19, 2009 03:26 PM

I think one big problem is that in order to correctly inflect syllables you must understand what is being said. I don't think we'll have good text-to-speech until we have good natural language parsers.

NickGravelyn

855

July 19, 2009 04:35 PM

Quote: Original post by zedz
Yes I had a look recently at synth speech when one of the natal demos were shown
to see if that like practically everything else with the videos + was faked + you guessed it, it was!

Who ever said the Natal demo was doing speech synthesis? They likely just recorded some lines (much like every video game) and had the computer play the pre-recorded audio.

laztrezort

1,058

July 19, 2009 04:44 PM

Quote: Original post by smr
I think one big problem is that in order to correctly inflect syllables you must understand what is being said. I don't think we'll have good text-to-speech until we have good natural language parsers.

I'd have to agree here. Think about all the different subtle variations a simple sentence can be said, depending on the context. There are way more variations in inflection than one might think, depending on mood and meaning, that we subconsciously parse when understanding speech.

And last I checked, natural language parsers are a difficult problem in AI. Possibly in the class of the most difficult problems in AI, actually.

_the_phantom_

11,263

July 19, 2009 04:49 PM

Quote: Original post by NickGravelyn
Quote: Original post by zedz
Yes I had a look recently at synth speech when one of the natal demos were shown
to see if that like practically everything else with the videos + was faked + you guessed it, it was!
Who ever said the Natal demo was doing speech synthesis? They likely just recorded some lines (much like every video game) and had the computer play the pre-recorded audio.

tsk! Letting the idea of facts and logical thought should never get in the way of bashing a new piece of technology!

I mean, come on, we are on a gamedev site; god forbid we embrace new technology or at least wait until it's had a decent airing which is not purely publicity driven before jumping to conclusions!

This post brought to you with a HUGE amount of sarcasm and a general 'meh'ness towards the bashing of every new idea which ever occurs; god bless game development and it's NIH syndrome. May 1001 wheels be reinvented daily!

It's 2009 why cant we do real sounding text to speech?

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

It's 2009 why cant we do real sounding text to speech?

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines