512*512*2 = 524288 triangles
524288 * 220fps = 115,343,360 triangles/second
= not possible on an intel integrated graphics.
No intel integrated graphics chip supports vertex shaders or hardware transforms, so the absolute max you would be looking at would probably be ~8 million/sec.
The max you will see with absolutly bare bones geometry (ie, no lighting, no textures, nothing) is around about 80 million and thats on a x850 or the like, and thats when using VBO, with vcache optimizations, etc.
So I'd suggest your frame rate counter is broken.
Extreme performance differences
I finally found the bottleneck - VTune gave the big hint :D
It was the glDrawElements calls, from which I had a _whole_ lot sending 24k vertices to the GPU every call. Basically the amount of calls and the amount of vertices being sent by themselves were both "ok", but it was simply too many vertices per single call.
So I only changed one little variable - the one telling how big one logical unit is. Initially it was 64 - meaning: my 512x512 heightmap was split up several, smaller 64x64 units. I know the middle point and the radius of one LU (logical unit), so I used a "sphere in frustum" function, to determine if that whole block is within the frustum. If it is, I render the whole thingy at once (64*64*6 vertices, because LUSize was 64).
All I now did is adjustung the LUSize value down. First to 32 which gave a speed increase of 100% - then knew that's the bogey ;)
After testing around a bit, I found out that the optimal value for LUSize is 8, which means within one frame I always send a huge sequence of 8*8*6 vertices to the GPU.
On machine b I now also come up to 190-230 fps.
I expect that fact (way more performance with minimal vertices per call) to be changed as soon I start using (dynamic) VBOs - I will let you know as soon as I tried it out ;)
So it only was changing LUSize=64; to LUSize=8; within ~4000 lines of code :D
The only thing I still don't understand is, why machine c was doing so good. Must have to do with the crappy graphic card in some way.
Anyways...thanks a lot everyone - you really helped me out here ;)
Greets
Chris
Edit:
@RipTorn: I'm not sending all the vertices at once - I first do a visibility check. Depending on viewing distance, all together about 50k (at standard, 64.0f viewing distance) triangles were being rendered - now, after the little change it should be less.
@markr: Sorry, I kind of overread your post :/
...but you were right, VSync was disabled
[Edited by - Hydrael on July 19, 2005 2:36:22 AM]
It was the glDrawElements calls, from which I had a _whole_ lot sending 24k vertices to the GPU every call. Basically the amount of calls and the amount of vertices being sent by themselves were both "ok", but it was simply too many vertices per single call.
So I only changed one little variable - the one telling how big one logical unit is. Initially it was 64 - meaning: my 512x512 heightmap was split up several, smaller 64x64 units. I know the middle point and the radius of one LU (logical unit), so I used a "sphere in frustum" function, to determine if that whole block is within the frustum. If it is, I render the whole thingy at once (64*64*6 vertices, because LUSize was 64).
All I now did is adjustung the LUSize value down. First to 32 which gave a speed increase of 100% - then knew that's the bogey ;)
After testing around a bit, I found out that the optimal value for LUSize is 8, which means within one frame I always send a huge sequence of 8*8*6 vertices to the GPU.
On machine b I now also come up to 190-230 fps.
I expect that fact (way more performance with minimal vertices per call) to be changed as soon I start using (dynamic) VBOs - I will let you know as soon as I tried it out ;)
So it only was changing LUSize=64; to LUSize=8; within ~4000 lines of code :D
The only thing I still don't understand is, why machine c was doing so good. Must have to do with the crappy graphic card in some way.
Anyways...thanks a lot everyone - you really helped me out here ;)
Greets
Chris
Edit:
@RipTorn: I'm not sending all the vertices at once - I first do a visibility check. Depending on viewing distance, all together about 50k (at standard, 64.0f viewing distance) triangles were being rendered - now, after the little change it should be less.
@markr: Sorry, I kind of overread your post :/
...but you were right, VSync was disabled
[Edited by - Hydrael on July 19, 2005 2:36:22 AM]
Quote: Original post by Hydrael
The only thing I still don't understand is, why machine c was doing so good. Must have to do with the crappy graphic card in some way.
Probobly, those graphics cards where never designed for openGL or anything 3d by the way.
so the OGL drivers kinda ignores stuff, that's why it might run faster, but the result is not the same.
www.flashbang.se | www.thegeekstate.com | nehe.gamedev.net | glAux fix for lesson 6 | [twitter]thegeekstate[/twitter]
It's possible that the partially software implementation (ie. Intel graphics) has different bottlenecks than the hardware one and thus your code just happened to be playing nice with the Intel onboard, and not the dedicated hardware. In face if you are rendering straight from RAM, my guess is that this is the case (Intel is optimized to always render from RAM... it emulates things like VBOs etc. with host buffers - plus no need to send the data to the GPU).
However, if you send the data in well-formed batches always using hardware buffers, I can guarantee that the GPUs will outperform the onboard solution every time, or something is very wrong ;)
However, if you send the data in well-formed batches always using hardware buffers, I can guarantee that the GPUs will outperform the onboard solution every time, or something is very wrong ;)
This topic is closed to new replies.
Advertisement
Popular Topics
Advertisement