23 minutes ago, chlerub said:Hmyeah i guess so. But when the API calls alone make such a huge difference... oh well
It does not make sense either to measure the DirectX CPU that way. Especially the nVidia driver use heavy multithreading and has a notorious fat do it all Present call black-box. What you may have observe is some cost migrating from a deferred path to immediate and vice versa. And in all real scenario i had to observe, AMD drivers always perform worse than nVidia in regards to their CPU usage accross a frame, trust me !
Also, the speed of light of a single draw calls, if interesting, usually does not matters, drivers are not optimized to send one draw call, but thousands with complex state changes. The difference you see could absolutely vanish in a real case scenario because the driver can work in parallel and do what so ever.
As for the GPU, a Cube is again one of the worst unit testing you can have. It does not provide a proper amount of work to the GPU per instance and you can hit hidden bottleneck with partially empty wavefronts and bad vertex cache usage.
I recommend you to focus on GPU performance, it is always possible to some extent to improve CPU by getting rid of redudant states, reordering draw orders, but your GPU frame usually quickly reach a sum of little things that are as fast as possible but that are to be here ! And to measure the GPU, you need to use time stamp Queries first, Then make sure that your driver is not throttling frequency/voltage when you run !!!!!
And for your sanity, forget about vertex buffers, they are so rigid, even when not involving per instance data. Even if it would cost you a little more cpu, the gain in flexibility, control and maintenance is priceless that you should decide to afford it !