Advertisement

Multithreaded Task Management + Task Pool Handling

Started by October 15, 2024 10:53 PM
3 comments, last by All8Up 2 months, 1 week ago

I am writing a project were I need to place objects in stl vectors so that threads can request the pending objects. These stl vectors are grouped so that each one has an ID to specify what thread is going to handle them. This is all pretty simple but I want to throttle my execution. I want to be able to determine based on a priority how many times to reference the objects in the lists for subsequent execution. This means the pointer will be located multiple times in the distributors lists as scheduled based on priority. I also want my threads to sleep when not executing objects, which means they would be idle. Creating more threads might be beneficial on my part with this system.

How would I go about calculating the allocated time for each object that is weighted based on priorities? If anyone could help this would be great. Also you can place as many copies of the objects in the lists if needed to be done for increasing priority. I plan to use the pipeline model for parallel computing. I will handle the culling of renderables in one thread, then matrice mathematics in the second, and generating normals in the third thread and render the buffer objects from the fourth thread. I only need to know how many threads to create in a dynamic fashion. I do know there is a time trade off for creating threads and it comes with a penalty.

Advertisement

I recommend storing object IDs instead of pointers. It makes it easier to maintain a robust app when requests which are separated in time from when they will be processed look up those objects at the time of processing (and fail gracefully if f those objects no longer exist).

There are many heuristics that can contribute to how frequently an entity needs to be processed. Intrinsic gameplay importance, proximity to a player, actual physical size and velocity…

It’s not about allocating time per object, it’s about budgeting how many objects you can process per frame. You want to maintain a priority queue of entities to update. Much of the priority will come from the heuristics I mentioned. But ‘time since last update’ should also be a factor.

Weighted and taken together, these various considerations should allow you to figure out which N entities have the greatest need to get a fresh new update on any given frame.

scruthut said:
I also want my threads to sleep when not executing objects, which means they would be idle. Creating more threads might be beneficial on my part with this system.

Usually it's best to create as much threads as you have CPU cores, but not more.
Each thread grabs a task, ideally with atomics to avoid the cost of a mutex, processes it and when there is no more work it goes to sleep.
If you have 1000 workitems, that's much faster than creating 1000 threads.

scruthut said:
How would I go about calculating the allocated time for each object that is weighted based on priorities?

What's the expected profit of estimating in advance how much time the job takes?
You can ofc. gather statistics from current work, which you then use to distribute future workloads better, e.g. to keep the application responsive or match some target framerate.

scruthut said:
Also you can place as many copies of the objects in the lists if needed to be done for increasing priority.

If there are multiple instances of a workload, you likely get additional synchronization costs to see if it was already processed from another list / thread. Those costs can be higher than the expected win.

scruthut said:
I do know there is a time trade off for creating threads and it comes with a penalty.

Yes, and those costs are high. So why not a standard job system, which creates one worker thread per core and keeps it alive all the time?

scruthut said:
I plan to use the pipeline model for parallel computing. I will handle the culling of renderables in one thread, then matrice mathematics in the second, and generating normals in the third thread and render the buffer objects from the fourth thread.

Ah, i see. For that a job system isn't needed, and generating just 4 threads should work fine.

The alternative (which i had in mind til here), using a job system could look like this:
All threads work on the culling. Each job culls N objects before the thread picks a new job (requiring to figure out a good value of N).
The main thread prepares the matrix workload while the woker threads are busy, then it goes to sleep, waiting on the culling to finish.
Then the worker threads start the matrix work and so on.

Advantages:
The system scales automatically to any core count.
No need for frequent synchronization, assuming the jobs are lock free.
Because all threads do the same thing, cache utilization might be better. (But can also be a disadvantage since CPU needs to synchronize caches and memory across multiple cores, and multiple cores accessing the exact same memory can cause a bottleneck.)

But that's just a proposal to consider. I have little experience with comparing these two aproaches of multithreading. I always use the fine grained job system aproach, so i can't tell. It's surely possible to mix both approaches as well.

It is a bit difficult to understand your actual needs based on the description but here are a few basic thoughts. First off, I'm going to dissagree with JoeJ to start with: do not even look at or consider lock free/atomics when setting things up. Your initial goal is to get correct and safe behavior with reasonable performance from the system to start with, not micro optimize it. Having said that, you won't scale horizontally as well without lockless etc, but the goal up front is all about architecture and getting rid of as much synchronization as possible long before you dig into minimizing the cost of said synchronization. Proper architecture in threading has a much greater impact than any other area you typically work in. Tiny changes tend to have exponential impacts, and while lockless/atomic reduces those problems it leaves you with a ticking time bomb as you scale up and those glossed over issues come back an bite ya directly on the butt. You want to know/fix as many of those up front as possible before you optimize that lower level.

Anyway, the very first thing I would consider here is to rethink how to approach the problem. The problem statement seems to be a bit messy, so I'm going to start by simplifying things to what I “think” you want to end up with and start there. Obviously if I'm making incorrect assumptions you'll need to come back and correct it, but for the moment, it is best guess.

Basic loop:

  1. Update objects for this time slice. Dealing with ‘priorities’ as to which objects get updated.
  2. Perform pre-render transformations. (I.e. the bit which does your matrices etc.)
  3. Perform culling.
  4. Render.

Now, as JoeJ mentioned, there is generally no need to do this using multiple threads which are dedicated to each step, in fact that is typically a very bad idea as it increases synchronization overhead and latencies. Reusing the same set of threads is a much better solution as it centralizes all the problems so you need only solve the issues once. So, start with a ‘staged’ execution model, which basically means in the above you do a full fence between each step. (This is also how most pipelined models work under the hood, they don't use separate threads either.) Anyway, making sure stage 1 executes to completion before starting stage 2 then 3 etc is fairly easy so I'll just skip that bit. Let me know if you need suggestions there.

The core bit that remains is dealing with priorities and such. The first thing to consider is how you want to perform the update in general. Is it based on frame rate or time stepping is a core element. Personally I prefer time step for determinism and other benefits so I'll just assume that. If wrong, the same approach works for loop counting just as well, it just uses a different concept of ‘when and now'. First off, get rid of the “priority”, that is a very difficult thing to actually figure out which is likely where you are getting stuck in general because it needs a lot of guestimated information. Instead, consider each entity as having a ‘rate’ of update. So, if you want something updated at a high priority, it would have a low rate value, i.e. the delta time between one update and the next. This ends up giving you the same controls as a “priority” but without any of the complexity of tracking last update and when to schedule for the next step. It also solves a bunch of further problems you haven't mentioned and of course gives you a clear way to idle the threads when nothing is supposed to happen.

Now, you only need the entity in the list of things to update once and you can simply insert sort each item based on when in time they should update next. When you actually update something, after you are done, re-add the entity with now+rate. This handles everything related to priority in terms that some entities might update 5 times before another updates once and it is exceptionally simple to deal with.

Beyond the ease in which this deals with priorities and time slicing, it also provides the following:

  1. Batching. When you get to the next update you grab everything in the vector with “time ≤ now”. So, you can execute everything in parallel by feeding partitions of that result to the worker threads.
  2. Threads can be put to sleep. You generally want to let the threads sleep to the minimum value of ‘next time to render’ and ‘next object which needs an update’. Of course, OS sleeps are notoriously inaccurate so you want to build in some slop and/or do something a bit better to get accurate wake times, but that is beyond scope here.
  3. When first adding entities to the list, make sure to randomize their update time a little bit. You don't want a bunch of entities trying to update every 10th of a second exactly for instance, so spread them out over several frames.
  4. Easy to optimize. For instance, insert sorting into a vector is going to be a bottleneck if entity counts get too high. Using a timer wheel is often a better solution. See: https://blog.acolyer.org/2015/11/23/hashed-and-hierarchical-timing-wheels

This all forms the basis of most pipelined event processing systems and is a fairly well proven starting point for what I think you are looking to achieve. The only real difference from most implementations is that I assume you want to render even if there are no entities updating, hence the min of update or next render. Beyond that, all standard stuff I've used over and over for different projects.

This topic is closed to new replies.

Advertisement