uh… sadly this can't work the way you want.
fido9dido said:
[numthreads(1, 1, 1)]
Notice this means only one thread of 32 or 64 (depending on HW) will do any work at all. The other remain idle and you can not expand this.
Currently shaders can not call other shaders at all. So subfunctions are not possible. You can only implement such things on API sides using multiple dispatches and synchronization between them, which has a cost.
fido9dido said:
[numthreads(32, 32, 32)]
This would be too much threads. It would lead to a workgroup size of 32.768 threads, which is much more than a single CU (former AMD) or SM (NVidia) has. APIs define limits here, usually it's 1024 at most.
Let's make an example about older GCN archtecture from AMD, because i know that best personally. Here one CU has 64 threads which execute a program in lockstep. If you make a larger workgroup of size 256, the CU will execute 4 times the same code sequentially. But as a programmer you can think they would run in parallel. Also, all 256 threads can access the same block of LDS memory which has been reserved to the workgroup if you use it.
What's important to know from all this is: A workgroup should have a size of at least 32 (or 64 on AMG GCN), and at most of 256. 512 and 1024 works as well, but then less workgroups will be available to switch them, which is like hyper threading on CPU. Such switches of active workgroups is important to hide memory access latency.
Probably this means you have to subdivide your work into smaller chunks os proposed in the other topic, and each chunk will process in random order, so they need to be truly independent of each other.
fido9dido said:
void otherfunc()
This works, but it's not implemented as a function call like you imagine. It's more a tool to reduce shader code size e.g. if you need similar math often. All threads will execute it in parallel, if some threads are masked out due a branch they could not do another function at the same time.
So you could just inline all your functions and the result would be the same )although in practice, inlining the code is usually faster :/ )
I usually recommend the chapter about compute shaders from OpenGL Superbible. It's very good, also because it explains building blocks of parallel programming quickly, which is more important than those (confusing) details about HW.
I don't know a good resource for DirectX, but it's basically the same on both sides. Just terminology differs. If you can, read it.
You can describe your problem so we could propose how to implement it.