It's 2009 why cant we do real sounding text to speech?

CodaKiller · 2009-07-20T11:15:49

I think it would be really nice for indie game developers if there were a realistic sounding text to speech API, why has text to speech not been perfected yet? Is it even possible to make text to speech which sounds real?

Extrarius

1,412

July 19, 2009 06:02 PM

Personally, I wonder why so much of the text-to-speech and speech-to-text systems out there (that I've seen, at least) work with natural languages instead of phonetic systems like IPA (with, perhaps, additional information that describes intonation, stressing, speed, etc).

If you could develop quality IPA-to-speech and speech-to-IPA (which should be far easier than actual language processing), you could accomplish many of the same goals as text-to-speech and speech-to-text using traditional AI algorithms. It could work much like old-school adventure games did with "natural language" input, except more sophisticated with the additional computing resources and knowledge available today.
You'd also have the ultimate VOIP system - build a pronunciation profile for each user based on them reading some well-chosen text, and aside from that initial transmission, it would only take a few bytes per second to transmit speech.

"Walk not the trodden path, for it has borne it's burden." -John, Flying Monk

BLiTZWiNG

362

July 19, 2009 06:06 PM

You can get decent text to speech, but it's gonna cost you a pretty penny. It needs a lot of setup time, someone with a clear voice and a lot of love and attention placed into getting the text right, which usually means butchering the mother tongue.

We use TTS in our products, and it works reasonably well, granted we're not a game development house.

zedz

291

July 19, 2009 06:18 PM

Quote: Original post by NickGravelyn
Quote: Original post by zedz
Yes I had a look recently at synth speech when one of the natal demos were shown
to see if that like practically everything else with the videos + was faked + you guessed it, it was!
Who ever said the Natal demo was doing speech synthesis? They likely just recorded some lines (much like every video game) and had the computer play the pre-recorded audio.

Yes I know what they in fact done + that the whole thing was fake
but If you watched the video you would see they imply(*)
that with this technology it will

A/ understood what you said
B/ could reply back to you in perfect english
C/ conduct an intelligent conversation
D/ recognize objects by sight
..
well a whole long list

(*)fair enuf if you havent watched the demonstration

Kirl

174

July 19, 2009 06:53 PM

Also Milo calls you by name, which would require some sort of text-to-speech.

I have a fairly uncommon RL name, I'm interested how/(if at all) it'll handle that. Not to mention all the wild Us3rN@m3z. :)

Promit

13,404

July 19, 2009 07:08 PM

You know, you're not supposed to feed correct English to a TTS system. You're supposed to write it, usually using their hints, so that the text it's speaking actually includes intonation information. So parsing and understanding English aren't really difficulties unless you're trying to use your TTS engine quite naively.

SlimDX | Ventspace Blog | Twitter | Diverse teams make better games. I am currently hiring capable C++ engine developers in Baltimore, MD.

BLiTZWiNG

362

July 19, 2009 10:06 PM

Quote: Original post by Promit
You know, you're not supposed to feed correct English to a TTS system. You're supposed to write it, usually using their hints, so that the text it's speaking actually includes intonation information. So parsing and understanding English aren't really difficulties unless you're trying to use your TTS engine quite naively.

I think our intention was that our end users could use it naively. I think we've come to the conclusion now that that is not going to be possible.

Sander

1,332

July 20, 2009 02:08 AM

Quote: Original post by Extrarius
Personally, I wonder why so much of the text-to-speech and speech-to-text systems out there (that I've seen, at least) work with natural languages instead of phonetic systems like IPA (with, perhaps, additional information that describes intonation, stressing, speed, etc).

Easy. You would still need to convert the plain text to IPA, which requires -- tadaa -- natural language parsing.

<hr />
Sander Marechal<small>[Lone Wolves][Hearts for GNOME][E-mail][Forum FAQ]</small>

Extrarius

1,412

July 20, 2009 07:31 AM

Quote: Original post by Sander
Quote: Original post by Extrarius
Personally, I wonder why so much of the text-to-speech and speech-to-text systems out there (that I've seen, at least) work with natural languages instead of phonetic systems like IPA (with, perhaps, additional information that describes intonation, stressing, speed, etc).

Easy. You would still need to convert the plain text to IPA, which requires -- tadaa -- natural language parsing.

While technically true, you omit the fact that for many uses of text-to-speech (and speech-to-text) it would be possible to work IPA either directly or indirectly. For example, to make a game where NPCs speak, it'd be easy enough to make a dictionary of english-to-IPA for developers to use when building dialog text (so they can produce both an english and a phonetic version of the speech). The developers could also input additional data to help it the sound more natural. Since they're directly describing the sounds made, they can also easily give different characters different accents, dialects, and manners of speech.
Conversely, text-to-speech that works directly with text is less tunable and must be far more complicated to produce equal results.
I'm a big fan of small, modular pieces rather than monolithic systems, and a natural-text-to-speech system is monolithic by design.

"Walk not the trodden path, for it has borne it's burden." -John, Flying Monk

Sirisian

2,263

July 20, 2009 07:57 AM

Quote: Original post by Extrarius
For example, to make a game where NPCs speak, it'd be easy enough to make a dictionary of english-to-IPA for developers to use when building dialog text (so they can produce both an english and a phonetic version of the speech). The developers could also input additional data to help it the sound more natural. Since they're directly describing the sounds made, they can also easily give different characters different accents, dialects, and manners of speech.

Heh bioware would love a system like that probably. Not sure if you guys have paid much attention to SW:TOR, but they have a lot of voice actors.

I don't mind the Microsoft Sam text to speech. I used to use something in the game America's Army (it was probably Microsoft Sam) since they had it set up to read the text.

OrangyTang

1,298

July 20, 2009 07:58 AM

All this and no mention of Tom Baker Says? It's probably the most impressive voice synthesis I've seen (heard). The delivery is still slightly stilted but it doesn't have the robotic twang that the AT&T stuff has.

[size="1"][[size="1"]TriangularPixels.com[size="1"]] [[size="1"]Rescue Squad[size="1"]] [[size="1"]Snowman Village[size="1"]] [[size="1"]Growth Spurt[size="1"]]

It's 2009 why cant we do real sounding text to speech?

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

It's 2009 why cant we do real sounding text to speech?

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines