Quote: First, before we can improve performance, we have to know where we stand. - ...we then have a full 24,000 intersection checks. - That's a LOT of computation, especially on the 360. - 25 tesseracts. Release build. No debugger. - It was time to use... (zoom in on face) THREADS. - Something kept bugging me though. - I wanted to try and make it lockless to further reduce any blocking. - If it didn't work I could always roll back...And now the conclusion.
Well, to get started, any thread pool is going to need threads, so we need to set some up:
_Threads[0] = Thread.CurrentThread; Thread.SetData(_ThreadIndexStoreSlot, 0);#if XBOX Thread.CurrentThread.SetProcessorAffinity(XBoxCoreMap[0]);#endif if (!forceSingleThread) { for (int i = 1; i < ThreadCount; i++) { _Threads = new Thread(delegate() {#if XBOX Thread.CurrentThread.SetProcessorAffinity(XBoxCoreMap);#endif Thread.SetData(_ThreadIndexStoreSlot, i); _TaskInitWaitHandle.Set(); ThreadProc(); }); _Threads.Start(); _TaskInitWaitHandle.WaitOne(); } }}
_Threads is an array of all the worker threads held by the manager. At the top, we set the first element to be the current thread, set its thread local index value to 0, and set its hardware thread affinity to the first entry in the mapping. XboxCoreMap is an array holding the indices of the hardware thread slots that the worker threads should use. I define it as { 1, 3, 4, 5 }, because slots 0 and 2 are reserved by the XNA framework itself. So the main thread gets slot 1 (which is actually its default anyway, but I set it explicitly anyway just to be thorough).
After the main thread setup, we have a loop that spins up the other workers. We new up a thread for each entry, giving it a ThreadStart delegate that sets its processor affinity, its thread local index, then signals a wait handle. This wait handle is crucial. As I mentioned in part 1, it keeps the main thread in sync with the worker threads, allowing each one to read the current value of 'i' before the main thread changes it for the next loop iteration. After the wait handle, it drops into the main worker thread procedure, which we'll see later.
Now that we have our workers ready to go, how do we actually go about performing tasks? The DoTasks method is the main entry point for the task manager that other code will call when it wants parallel work done:
public void DoTasks(List tasks){ if (_Disposing) { throw new InvalidOperationException("Cannot run after disposal."); } if (_Running) { throw new InvalidOperationException("Cannot run while already running."); } _Running = true; _Tasks = tasks; try { if (_ForceSingleThread) { for (int i = 0; i < tasks.Count; i++) { tasks(); } } else { _CurrentTaskIndex = 0; _WaitingThreadCount = 0; _ManagerCurrentWaitHandle.Set(); TaskPump(); while (_WaitingThreadCount < ThreadCount - 1) { Thread.Sleep(0); } if (_Exceptions.Count > 0) { DoTasksException e = new DoTasksException(_Exceptions); _Exceptions.Clear(); throw e; } } } finally { _ManagerCurrentWaitHandle.Reset(); _ManagerCurrentWaitHandle = _ManagerCurrentWaitHandle == _ManagerWaitHandleA ? _ManagerWaitHandleB : _ManagerWaitHandleA; _Tasks = null; _Running = false; }}
The first thing we have is simple checks to make sure the manager hasn't been disposed of, and that it's not already running (on another thread). Next we mark it as running, and set the internal _Tasks list to the list that got passed in. This list will be read in parallel by the worker threads, so I use a concrete List instead of an IList. With an interface, there's no way of knowing what's actually going on inside the accessors, and if they're actually thread safe, so instead I use a List which is clearly defined as being an array internally. Here we see the _ForceSingleThread flag again. If this is set by the constructor, the task manager will not create any threads at all and will execute all the tasks serially. This is mainly for diagnostic and comparison purposes.
The meat is in the 'else' clause, which prepares to execute the tasks by setting the start index to 0, and the number of finished threads to 0. It then signals a ManualResetEvent that simultaneously activates all the worker threads. This avoids the one-by-one activation of my previous implementation. It then enters TaskPump, which is the method that actually acquires and executes work items. Thus, my other goal of getting the main thread to do work as well is achieved. When it comes back from that, it waits in a loop for the other threads to finish.
Once all threads are finished, we check to see if any threads added an exception to the internal _Exceptions list. If they did, we aggregate them all into a single DoTasksException and throw that back to the caller so they can examine it, or allow it to fall through to a debugger.
Not to be overlooked, the finally block below plays a critical role. First it closes the wait handle we opened earlier to start the workers. Next, it switches the current wait handle that it's using to the other of the 2 defined in the class. This will make more sense when we see ThreadProc:
void ThreadProc(){ while (true) { Interlocked.Increment(ref _WaitingThreadCount); _ManagerWaitHandleA.WaitOne(); if (_Disposing) { return; } else { TaskPump(); } Interlocked.Increment(ref _WaitingThreadCount); _ManagerWaitHandleB.WaitOne(); if (_Disposing) { return; } else { TaskPump(); } }}
This is the loop that all worker threads spend their time in. The first thing they do as they enter is safely increment a value that indicates the current number of threads that are in a waiting state, then waits on wait handle A. At first it seems that the loop repeats the same code twice. The second portion is the same as the first, except it waits on handle B. This plays into the "current wait handle" we saw on the main thread. The first time DoTasks is called, it will signal handle A, while handle B is closed. This means that all workers will run TaskPump, then block on handle B. The main thread then closes handle A and uses handle B next time. Similarly, the workers will move when B is signaled, run TaskPump, then block once more on A. This A/B/A/B pattern makes it simple to tell all workers that they can start running, while at the same time being sure that they'll stop when you want them to. With only a single handle, you would need to wait until all workers are confirmed to be activated, but not wait too long or one worker might finish its work and pass the wait handle again before it closes. As a final note, the _Disposing flag is set when the manager is disposed of and tells the workers to return, thereby ending the thread.
The final piece of the puzzle is TaskPump itself:
void TaskPump(){ List tasks = _Tasks; int taskCount = tasks.Count; while (_CurrentTaskIndex < taskCount) { int taskIndex = _CurrentTaskIndex; if (taskIndex == Interlocked.CompareExchange(ref _CurrentTaskIndex, taskIndex + 1, taskIndex) && taskIndex < taskCount) { try { tasks[taskIndex](); } catch (Exception e) { lock (_ExceptionsLock) { _Exceptions.Add(new TaskException(tasks[taskIndex], e)); } } } }}
In this method, we enter a loop that will continuously execute until the _CurrentTaskIndex indicates that all work items have been fetched. The next bit is the dangerous lockless part. First we make a local copy of the current task index. Next, we compare that local copy to the result of an Interlocked.CompareExchange call. This call will safely store the second argument into the location of the first, but ONLY if the third argument is equal to the first before the replacement, all as an atomic operation. If our local copy of the task index matches the shared copy, then we effectively "claim" that index for the executing thread. If it doesn't match, then another thread has altered the value between when we made our local copy and when we tried to make the check. This means that another thread has claimed the index and we must try again with the new value. If we pass the check, then we use our claimed index to retrieve the corresponding task from the task list, and execute it. If it throws an exception, we catch it, lock the exception list, and add it in for the main thread to sort out later. I don't try to avoid locks with this part, because if task code is throwing exceptions on a regular basis, then something's broken and should be fixed.
With all the pieces in place, it was time for the moment of truth. I hooked everything up, deployed to 360, and...
150 tesseracts.
AWESOME. I was totally stoked. I now had a full 3x improvement over the single threaded version. Maybe this lockless thing wasn't as hard as they said! Of course, several runs later, I was finally punished for my hubris.
A seemingly random crash out of nowhere. I went cold. Thankfully I was currently in debug mode and caught the exception, and I knew I had to try and fix it then and there, since if it was some kind of arcane synchronization bug, I might not be able to get it to happen again for a long time. As it turned out, one of the workers was trying to read a task just past the end of the list. But how? The check in the while loop should catch that right? Wrong.
Let's say 2 threads pass the check in the while loop, but pause before attempting to claim an index. Further let's say one of the threads moves through the compare exchange and grabs one successfully, and that immediately after that, the other thread does the same. The current index counter has now been incremented twice. This if fine and dandy, but what if there was only actually 1 work item left in the list? This is where the "&& taskIndex < taskCount" check that I glossed over in my explanation comes in. After slapping that in, no more crashes (yet).
Is it foolproof now? Honestly, who knows. It's working great so far, and I'll be continuing to rely on it for the foreseeable future, so we'll see how it goes. In part 1 I promised a bonus, so here's the full code for the manager for those interested:
http://members.gamedev.net/JPatrick/TaskComponent.zip
If anyone tries it out, please let me know how it works out for your program and if it actually gets you more performance.
So my tale of threading comes to an end. Perhaps next time I'll talk a bit about trying to wrangle a 4D editor out of WPF.