Advertisement

Present & Future AI in Games - Voice/Speech

Started by October 12, 2016 12:30 AM
14 comments, last by wodinoneeye 8 years, 2 months ago

I'm trying to get a handle on where the games industry stands on AI speech in terms of allowing NPC's to act more human. In most games now, voice actors are used to keep the story and experience with NPC's somewhat linear. Are any companies looking at expanding their approach to AI? What about using something like Watson in a game environment? Is it possible right now - why or why not?

Text to speech (TTS) isn't an AI problem.

I'm thinking of moving this in the site, but I'm not sure where it fits. It isn't an AI issue, isn't really a programming issue at all. It might fit better in design, but it isn't so much a design issue either. Oh well. :rolleyes:

The big problem is that TTS engines do not have a good range of emotion. Some systems allow the teams to tag them with some vocal effects. You can encode volume going up or down, pitch up and down, and a few rough elements like that.

But how do you leverage a TTS engine -- even one like Siri since that is popular -- make an NPC voice call out like they're in a market as a seller? Or like they're injured? Or like they deeply care that their child is lost? Or like their family has been killed?

Basically, how can you make a voice like Siri sound like a demon has killed their family? The simple answer is that you hire a voice actor.

Advertisement

Text to speech (TTS) isn't an AI problem.


Since he mentioned the "experience" and "story" of NPCs, I think he's talking more than just audio.

The big problem is that TTS engines do not have a good range of emotion. Some systems allow the teams to tag them with some vocal effects. You can encode volume going up or down, pitch up and down, and a few rough elements like that.
But how do you leverage a TTS engine -- even one like Siri since that is popular -- make an NPC voice call out like they're in a market as a seller? Or like they're injured? Or like they deeply care that their child is lost? Or like their family has been killed?

Basically, how can you make a voice like Siri sound like a demon has killed their family? The simple answer is that you hire a voice actor.


I think it would be possible to give TTS engines more emotions and variation and "life". I remember seeing six or seven years ago a program for taking Microphone speech and transforming your voice before sending it to whatever game you were playing, to change your voice in real-time for in-game VOIP communication. Things like making your voice sound like a robot or demon or (ostensibly) the member of the opposite gender - really basic audio filter stuff. I'm sure if there was enough market motivation (and game NPCs are a good example), it could be rapidly improved. It might already be improved significantly, since I saw that a half-decade ago.

(Personally, I was thinking that might be useful in a microphone-required game, for hiding or reducing the squeakiness of prepubescent voices, to reduce immersion-breaking, as well as prevent the harrassment that I've observed when a kid tries to play a real team-focused FPS. Would possibly also be useful to remove noise and repair sound from cheap microphones)

You'd likely have to add annotations to your NPC text to indicate what emotions should be used, but that's not much of a problem.

Text to speech (TTS) isn't an hasn't been an AI problem until recently.

FTFY :D (assuming machine learning == AI)
https://deepmind.com/blog/wavenet-generative-model-raw-audio/

In most games now, voice actors are used to keep the story and experience with NPC's somewhat linear. Are any companies looking at expanding their approach to AI?

The tech is not there yet. It's such a huge amount of work that it's not going to be a game company that creates this kind so speech synthesizer.
e.g. The link above shows a really advanced, ongoing research project in this area, and it's being done by a company backed by google-monies.

The good thing about the above system is that you could train it using several voice actors so that it's able to speak using their voices. This would let you mix generated/TTS speech and actual recorded speech as required.

Now on the other hand... if you decided up front that you want to make a game where the NPC's use synthesized speech, then you could design a game where all the NPC's are robots with bad abilities to process emotion and speak naturally :D Then current TTS systems would be well suited to your game!

I agree fully about the constraint that voice acting has placed on content generation, and look forward to what people come up with when that bottleneck is removed.

Neural architectures are really advancing TTS by leaps and bounds. But I give it about 2-3 years before it's ready for game use.

A lot of voice acting isn't emotionally quite right for a game's scene anyway, in part because the voice actor doesn't always know all the context, might not be recording at the same time as the other actors, is expecting other content that gets left on the cutting room floor... In a few years we might be preferring synthesized voice, because you can tweak it for content and emotion as the game changes without re-hiring the voice actor.

If you want to play around with making a TTS-based game using a state-of-the-art speech system, I recommend playing around with Amazon's Alexa Skills Kit. It's pretty easy, you can use AWS Lambda as your backend rather than set up your own server, and you could probably make a simple game -- speech in, speech out -- in a few days. Intonation isn't Alexa's strong suit, though -- especially compared to Wavenet, which completely raised the bar -- so a game like Hodgman suggests where she's actually playing as some sort of AI will be more realistic.

(It's also interesting to note that Amazon has been rolling out a lot of gamedev products like Lumberyard and GameLift. It's probably only a matter of time before Alexa speech tech finds its way into Lumberyard.)

I concur with the comments above on artificial speech, and how it doesn't sound enough like real people
to be used for supposedly human characters in a game. It's the "uncanny valley" problem - it can sound
really good, but not to the extent that a listener would be convinced. So until the valley is bridged,
artificial speech is only good for artificial characters (maybe robots masquerading as people, or maybe
robots or machines).

By the way, when artificial speech crosses the bridge and is convincing, it'll be used way beyond games.
And its potential for misuse is scary.

-- Tom Sloper -- sloperama.com

Advertisement

The game 5089 does this kind of. You submit quotes you want robots to say on the steam forum, the dev runs it through TTS software, and puts in the game.

So I find this really really fascinating. I also have a dream of a game that involves Proc-gen speech. The WaveNet is absolutely brilliant, and I can scarecely understand the underlying techniques, but I'd love to see more of that. I'd love to have access to that as a tool. I'm not sure how that works with stuff that's still being actively researched though. Still, crazy promising. You could possibly train different emotions into the system and have it sample those emotions instead when creating voices, or sample multiple at once to make things really interesting.

I'll look into Alexa. I was honestly considering making a proc-gen dialogue game that was about robots, so that normal TTS systems could be used without breaking immersion into tiny pieces. It'd be crazy, a world that was like a giant menu in some ways. That could be fun, a little robot society.

At the end of the day. There is still something to actual voice actors that can't really be duplicated. I do wonder though what would happen if you gave a voice actor a WaveNet tool to craft their own procedural performer? That would be really epic.


And its potential for misuse is scary.

And this too. A convincing human digitized voice, or being able to take someone's voice and easily make it say something else, that's a pretty significant change to life as we know it.


One thing that I have seen played with, but have not yet seen used in a full fledged game, is Text-to-Speech + markup scripting.

Related to typical general Text to Speech engines, but rather than being able to just dump any text into it and have it spit audio back out at you, you treat it more like text+metadata, and apply additional scripting into the system so that you can fine tune how the text becomes processed into Audio.

It is a LOT more work than simply writing dialog, and as it currently stands is probably also a lot more work to get rolling than writing dialog and hiring voice actors and a studio to record it, but it does have potential to eventually allow far more impressive TTS use in games and allow smaller developers to become vastly more flexible in what they can do.

One of its biggest advantages would be that you can conduct major rewrites fairly late in the development cycle, and not have to worry about the cost and expense of re-recording things and handling all that audio processing. It sure wouldn't be free voice over, given the amount of work involved in editing the markup data along side the text to get the tone and voice 'just right' for your project, but it would also enable a single designer to effectively product a wide cast of distinct and clear voices.

The other advantage is, if you control the TTS+markup software, then there is no 'renegotiating' contracts on sequels, and anyone trained on the team can produce new content with an iconic voice of the series.

Old Username: Talroth
If your signature on a web forum takes up more space than your average post, then you are doing things wrong.

One thing that I have seen played with, but have not yet seen used in a full fledged game, is Text-to-Speech + markup scripting.

Half-Life 1 did this for their "army grunt" AI characters, except using a very simple concatenative TTS system (based on many individual recordings of an actor speaking individual words). A script would tell the AI what to say and how to say it - such as pauses, volume and intonation.
Yep, it was great that as a modder you could add new lines of dialogue to an AI character without having access to the original actor.

This topic is closed to new replies.

Advertisement