Ive been considering Text to Speech use for a big Next Gen or Next Next Gen type of MMORPG which would require alot of voice generation (largely on the fly). That system relies on Player Created Assets to lower costs and Precanned Voice recordings are prohibitive (and not versatile enough), so autogeneration (being combinatoric script-logic driven) at the Client machine is an important element.
Uncanny Valley problem - may just have to train the user to accept the output (it a problem that probably never really will go away )
The proccessing resource issue - does/can the current 'better quality' TtS programming do the generation in Real Time (are they using GPU yet too...) -- as procedurally generated content being reactive to the player's responses in a versatile way (not just a limited pre-expected set of response verbage) is much required. (For that application some JIT generation may be allowable to alleviate some timely CPU resource bottlenecking)
Fortunately the general commercial use of this technology will keep development moving forward (having some big bucks behind it).
In comes the management of Voice Profile Assets and whatever markup data is required for the 'Text' to impart the desired inflections/tonal/etc content to be imparted to the output.
-
I recall one thing from the past : Atari 800 game Castle Wolfenstein with the crude lil nazis shouting a barely legible 'Achtung!' -- which might be a nice analogy for where the speech generation is today versus what it will be like in the future (we hope). If you remember how crude the sound interface was on the Atari 800, it makes me wonder how long the programmer had taken to shape just that canned sound snippet.
-
Another issue might be : in an immersive environment you need MANY NPCs to be speaking at the same time, thus increasing the speech processing load. Far away TtS conversations can be (cheaply) patched over with mumbles/droning (preferably to be realistically generated themselves with rythmic attributes/whatever), this implemented so as you near the source the transition to High Quality isn't jarring --
yes - Speech LOD now ...
--------------------------------------------[size="1"]Ratings are Opinion, not Fact