Advertisement

Multi-threaded AI

Started by July 24, 2013 03:37 AM
5 comments, last by wodinoneeye 11 years, 3 months ago

So I am working on various strategy game stuff and I tried to figure out how to use multiple cores to speed up my game. From what I gathered the best way to use threads is to run a single main thread and pass off tasks from the same subsystem rather than splitting off each subsystem into a thread.

Like I run this:

Simulation

AI

Physics

Graphics

Sequentially and get my performance from having each step use as many cores as the system has available. So during the AI step I would assign a main AI thread that would make high level decisions and then spin off low level tasks in a hierarchy to separate threads/cores.

I was given to understand that this method scales better than separate threads for subsystems. For instance a particle engine has a ton of tasks its can run separately and you can pretty much split them up arbitrarily. So you look at the number of cores available and split the tasks more or less evenly among them and if you have more cores to use you just split the tasks so each core does less.

Is that the best way to use multi-threading? Do I have something wrong? I understand that advice can really only be general since you don't have detailed information on my project.

Yeah that's the way that I utilize multiple cores, and it's the same method that the last few console games that I've worked on have used too.

To really simplify how the "job" type model works, a simple API might look something like this:
struct Job { void(*function)(void*); void* argument; }; 
JobHandle PushJob( Job* );//queue up a job for execution
void WaitForJob( JobHandle );//pause calling thread until a specific job has been completed
bool RunJob();//pick a job from the queue and run it, or just return false if the queue is empty
The main thread adds jobs to the queue using PushJob. If the main thread want to access data that's generated by a job, then it must call WaitForJob (with the handle it got from PushJob) to ensure that the Job has first been executed before using those results.
Your worker threads simply call RunJob over and over again, trying to perform the work that's being queued up by the main thread.


I also use another similar pattern, where I'll have all of my threads call a single function, but passing in a different thread ID, which is used to select a different range of the data to work on, e.g.

inline void DistributeTask( uint workerIndex, uint numWorkers, uint items, uint* begin, uint* end )
{
	uint perWorker = items / numWorkers;	
	*begin = perWorker * workerIndex;
	*end = (workerIndex==numWorkers-1)
		? items                      //ensure last thread covers whole range
		: *begin + perWorker;
	*begin = perWorker ? *begin : min(workerIndex, (u32)items);   //special case,
	*end   = perWorker ? *end   : min(workerIndex+1, (u32)items); //less items than workers
}

void UpdateWidgets( uint threadIdx, uint numThreads )
{
  uint begin, end;
  DistributeTask( threadIdx, numThreads, m_numWidgets, &begin, &end );
  for( uint i=begin; i!=end; ++i )
  {
    m_widgets[i].Update();
  }
}
Advertisement

The problem of using multi threads with AI is that it is very hard to avoid race conditions.

I will use pathfinding as an example because it is the most common use of AI I have seen in this forum. Say you are running A* on your graph, if one thread has processed a node and another thread changes that same node, the result won't be reliable. Worst results may happen, as referencing a node that no longers exist, resulting in a segmentation fault.

Using a task system is a good way to have your system executing several independent parts of the code at once (for instance, physics simulation, rendering, sound and particle effects math), but when it comes to IA is not that simple. Of course that this depends a lot on how your system works, if, for instance, your game doesn't allow path blocking (common on tower defense games), you may run pathfinding algorithms at the same time.

So, this is the best way to use threads on a game, but hardly likely to be the one for AI, also you must always keep in mind that you must no introduce race conditions.

Currently working on a scene editor for ORX (http://orx-project.org), using kivy (http://kivy.org).

Okay thanks. Just wanted to make sure I was starting at the right place. It would be horrible to do a lot of work and learn a bunch of stuff only to later realize I picked a poor method.

Now I just have to really dig into implementation.

I do have another question. How much benefit is possible here. Say single threaded vs multithread with 2/3/4 cores available? Obviously the specific implementation affects this value, but on average how much performance gain is there on RTS? Assuming you would have some way to know that.

I did google for info about multi-threaded RTS but there are only a couple of relevant results.

I know that most open source RTS games aren't multi-threaded, although I believe the Spring devs recently had a HUGE fight over making it multi-core compatible. But Glest stuff, 0AD, most other people don't seem to be doing it. So there isn't a lot of help I can get there.

Do you know of any good resources to help with this stuff?

The problem of using multi threads with AI is that it is very hard to avoid race conditions.

I will use pathfinding as an example because it is the most common use of AI I have seen in this forum. Say you are running A* on your graph, if one thread has processed a node and another thread changes that same node, the result won't be reliable. Worst results may happen, as referencing a node that no longers exist, resulting in a segmentation fault.

Using a task system is a good way to have your system executing several independent parts of the code at once (for instance, physics simulation, rendering, sound and particle effects math), but when it comes to IA is not that simple. Of course that this depends a lot on how your system works, if, for instance, your game doesn't allow path blocking (common on tower defense games), you may run pathfinding algorithms at the same time.

So, this is the best way to use threads on a game, but hardly likely to be the one for AI, also you must always keep in mind that you must no introduce race conditions.

I was thinking their might be a problem with AI. But I was thinking maybe I could use it for decision making, rather than pathfinding AI wise. And I am planning to add more particle stuff so I was hoping it could add speed ups there.

I am going to have quite a complicated AI I think decision wise, but I am trying to come up with other places it might improve stuff.

Also I was thinking of adding some less traditional stuff to the game which might help. Basically part of my plans involves more than one map, although probably I'll represent maps you aren't using as abstract rather than running all the code. I wanted to make a game where you manage multiple villages/cities that evolve into a kingdom with trade and technology exchange, but still having the actual 3D map, as opposed to something like Total War or Paradox games. I was thinking that even the abstract representation of a lot of cities might be heavy duty enough cpu wise to benefit from multicore.

The problem of using multi threads with AI is that it is very hard to avoid race conditions.

Whether this is true for any area (AI or otherwise) completely depends on the paradigm that you're programming within. There's plenty of paradigms, like functional, where you can easily write multi-threadable software, yet race conditions simply don't exist (aren't possible).

Yeah, if you take an existing OOP system and try and shoe-horn in parallelism, it will be extremely difficult to avoid race-conditions, or to implement parallelism efficiently... which is why this kind of planning should be done up front.

There's two main categories of multi-threaded designs: shared-state and message passing. The right default choice is message-passing concurrency, however, many people are first taught shared-state concurrency in University and go on using it as their default choice.

I will use pathfinding as an example because it is the most common use of AI I have seen in this forum. Say you are running A* on your graph, if one thread has processed a node and another thread changes that same node, the result won't be reliable. Worst results may happen, as referencing a node that no longers exist, resulting in a segmentation fault.

Using a task system is a good way to have your system executing several independent parts of the code at once (for instance, physics simulation, rendering, sound and particle effects math), but when it comes to IA is not that simple. Of course that this depends a lot on how your system works, if, for instance, your game doesn't allow path blocking (common on tower defense games), you may run pathfinding algorithms at the same time.


A job-graph of your example problem could look like the diagram below:
Green rectangles are immutable data buffers - they're output once by a job, and then used as read-only inputs to other dependent jobs.
Blue rounded-rectangles are jobs - the lines above them are data inputs (dependencies), and the lines below them are outputs (buffers that are created/filled by the job).
FjWRQyA.png
As long as the appropriate "wait" commands are inserted in between each job, there's no chance of race conditions here, because all the data is immutable.
Unfortunately, after inserting all the appropriate wait points, this graph becomes completely serial -- the required execution order becomes:
wait for previous frame to be complete
a = launch "Calculate Nav Blockers"
wait for a
b = launch "Check paths still valid"
wait for b
c = launch "Calculate new paths"
wait for c
launch "Move Actors"
This means that within this job graph itself, there's no obvious parallelism going on... sad.png

However, The above system is designed so there is no global shared state, and there's no mutable buffers that can cause race-conditions, so each of those blue jobs can itself be parallellized internally!
You can run the "Caclulate Nav Blockers" job partially on every core, then once all of those sub-jobs have completed, you can run "Check paths still valid" on every core, etc... Now your entire system is entirely multi-threaded, using every core, with a pretty simple synchronisation model, and no chance of deadlocks or race-conditions smile.png

Now the only problem is that due to the fact that you probably won't partition your work perfectly evenly (and you won't keep every core in synch perfectly evenly), you'll end up with "stalls" or "pipeline bubbles" in your schedule.
e.g. say you've got 3 cores, and that the "Caclulate Nav Blockers" job takes 0.4ms on core#1, 0.3ms on core#2 and 0.5ms on core#3. The next job ("Check paths still valid") can't start until all three cores have finished the previous job. This means that core#1 will sit idle for 0.1ms, and core#2 will sit idle for 0.2ms...
Or visually, rows are CPU cores, horizontal axis is time, blue is "Caclulate Nav Blockers", red is "Check paths still valid":
Z5d5VDR.png

You can mitigate this issue by having more than one system running at a time. For example, let's say that the AI system is submitting these red/blue jobs, but the physics system is also submitting a green and an orange job simultaneously. Now, when the worker CPU cores finish the blue job, and are unable to start on the red job, they'll instead grab the next available job they can, which is the physics module's "green" job. By the time they're finished computing this job, all cores have finally finished with the "blue" one, so the "red" one can now be started on, and the total amount of stalling is greatly reduced:
AmMZX3Q.png

I do have another question. How much benefit is possible here. Say single threaded vs multithread with 2/3/4 cores available?

In theory, 2x/3x/4x total performance cool.png
In practice, even if your software is perfectly parallelizable, there's other bottlenecks, like CPU-caches often being shared between cores (e.g. only two L2 caches between four CPU cores), which avoids you ever hitting the theoretical gains.

As to how much of a boost you can get in an RTS game... Almost everything that's computationally expensive, is parallelizable, so I would say "a good boost" tongue.png
I'd personally aim for maybe 1.5x on a dual-core, and 3x on a quad core.
People used to say that "games aren't parallelizable", but those people were just in denial. The entire industry has been writing games for 3-8 core CPUs for over half a decade now, and have proved these people wrong. PS3 developers have even managed to learn how to write games for multi-core NUMA CPUs, which has similarities to distributed programming.
To take advantage of parallelism in places we thought impossible, we often just need to re-learn how to write software... wacko.png
Advertisement

One thing to consider about paralleization is the data locking needed to make the data coherent (ie- state changing because different threads are doing smaller changes and the data isnt independant of what other thread also do)

Lots of data locks for fine granulated processing has often ALOT of overhead and can result in inefficiencies , which can be beat out by a single thread doing the entire pass of processing (with more cores and smaller individual core this is getting less so)

Independant processing which simultaneously acts on different set of data can be assigned to different 'cores' with little need for any locking mechanisms.

ex- pathfinding working off of a locked down current state of map (many may run sim,ultaneously independant because the data is read only)

and at the same time planners for next/future actions can be working from each objects context - again simultaneously.

The outputs of these are directives for actions in the future which can go thru to a single atomic queing process - but the locks are used so infrequent that they dont amount for much overhead (and have little waiting to stall other threads)

The entire game loop has the phases where current data is transformed into the next current data which is locked down so the AI can work on it.

Other tasks like graphics/network comnmunications/inputs can be simultaneously working on the AI/object state data previously created -- pipelined in big independant data lumps - with entire state data set copies 'buffered' (unchanageable by the other processing).

We are talking specificly AI here and if you have multiple objects (or even competing solutions for each objects state) you have parallelization opportunities as long as you can freeze the state being considered (to avoid huge data lock related problems)

--------------------------------------------[size="1"]Ratings are Opinion, not Fact

This topic is closed to new replies.

Advertisement