JoeJ said:
But you can implement a job system yourself with std::thread of course, likely using atomic counters to push and pop work, and i also use atomics for sync.
Well, `std::thread` itself is mostly a low level component, along with the atomics, locks, condition variables, etc. So to write a job system or other low level threaded code, something to consider rather than the Window's or pthread API, certain intrinsics, etc.
As well as the single API, the main advantage is handling things like passing parameters, return types, and capturing exceptions, while the platform API might basically just give you a “void*” sized parameter and an integer exit status for you to build stuff on top of.
In theory, the C++ async task/job system is `std::async`. An implementation may implement it with a thread pool to avoid the thread starting cost (I believe most do), and may restrict it to a fairly optimal max thread limit (don't recall noticing any do this, probably because they assume your threads will block and stuff). But you have very little control over this.
And when making your own thread pool, you can still make use of `std::future` etc. to get basically the same API, but with much more control. Although quite possibly you can gain a bit more performance without any of those C++ library things.
JoeJ said:
I have the same question - not sure if it is possible to get information which threads run on the same core or which cores run an the same chiplet etc.
There are API's to get the NUMA nodes, physical core count, logical core count, etc. The only multi-NUMA system you are likely to come across is 1st and 2nd Gen Threadripper. Anything more detailed, I think you would have to just code for the specific CPU series if you were interested, unless there are API's I missed detailing say CCX layout, and those will be forward compatible with whatever the next AMD/Intel/etc. design is.
You can set per-thread affinity. But at this point you are basically trying to out think the system scheduler. Would certainly need a lot of testing, because you have to consider what the 100's or 1000's of other processes/threads on the system are wanting to do. On something like say a console, where you can code to a specific design, this could make a lot more sense.
e.g. If you restrict one of your threads to a specific logical core, and the CPU is almost idle and your thread wants to run, but oh Chrome, or Steam, or whatever is currently on that core, well now you don't run until it's time slice is done, and are almost certainly a lot worse off.
Maybe the OS is clever enough to go “oh that thread wants to run 99% of the time and it's only on core 2, it's not running right now but still maybe I better put this other thing not on core 2”, but honestly that would surprise me.
Now maybe instead you try a mid way and lock all your threads to one logical core per physical core. This would still need a lot of testing though, and you still can't stop other programs running on the same physical core as you with SMT (at least without some nasty hacking to change other programs affinity dynamically).