Advertisement

Extreme performance differences

Started by July 17, 2005 02:13 AM
12 comments, last by AndyTX 19 years, 4 months ago
Hello everyone, a few days ago, I tried running my project on 2 other computers, to see if and how it runs. What I encoutered was quite surprising: I tested my project on 3 computers a) Siemens Notebook, 1.9 Ghz (Centrino), 1 GB DDR, Radeon 9700 Mobiliy (used for development) b) Pentium 4 3.2 Ghz, HT, 1 GB DDR, Radeon 9700 c) Pentium 4, 3.0 Ghz, 512 DDR, some crappy Intel onboard graphic chip At the moment I'm not using any Extensions, except for Point Sprites (which are supported by all PCs, except for c) I'm rendering a 512x512 heightmap which makes 786.432 polys, and I'm using a custom made frustum culling method. Now for the FPS I get on those three computers: a) 60-80 b) 20-30 c) 180-220 (!) It seems, that the computer, from which I expected the lowest results totally owns the other two, while b, which I assumed to be the stronges machine goes down the drain. Another thing I noticed is, that my program uses 100% CPU on a and b, while c only uses 50%. I tried debugging my project on all three computers to find out where this insane performance difference comes together, but I don't know where to start. Does anyone know any standard mistakes I could have made? Thanks in advance Chris [Edited by - Hydrael on July 17, 2005 2:32:53 AM]
50% would mean threaded cpu.

The pentium HT chips make windows think it is two cpus, and a single threaded application by default is only allowed to use 1 of the cpu (50% of the pentium)

But I can't explain why it would do a better job. The only thing I can think of is on the b computer, it was having issues (spyware.. virus..whatever).I dunno..
Black Sky A Star Control 2/Elite like game
Advertisement
I'm a total newb, so I could be totally wrong :)

Are you using compiled vertex arrays with indices? Display lists?
If not, just sending all those vertices across the AGP port every frame might be the bottleneck, whereas the good old Intel onboard chipset doesn't have this problem?

I dunno, that was just a thought.. hope ya figure it out.. I'm dying to know!
I'm using one big Vertex Buffer with indices, and access it with glDrawElements.
I can't use VBOs at the moment, for this part of the project is the world editor and has to be kept totally dynamic.
But that's a good thought TerraX. Maybe the Intel really doesn't have a problem with massive I/O whereas the newer cards want to be treated as a newer card (-> VBO and extensions galore) ;)
Hmm...this really makes me think there actually is something wrong with b (like ViperG mentioned), while a is doing good and c has an advantage until I start using extensions.

Would that make sense?
Because the program runs the same paths on all computers, except for one function that differs between "can use / can't use Point Sprites" - and that part almost doesn't do anything for the performance
Try VTune:
http://www.intel.com/cd/software/products/asmo-na/eng/219789.htm
There is an evaluation download option. I recently tried it and must say that it's excellent tool. It's also very easy to use, you can run "quick performance test" in just 4 clicks :)
It will show you each function/method/class/... performance.
Thanks a lot, I will give that a try
Advertisement
Quote: Original post by Hydrael
I can't use VBOs at the moment, for this part of the project is the world editor and has to be kept totally dynamic.

Yes you can - just use dynamic VBOs :) (aka. DYNAMIC_DRAW, etc). They are still about 3-5x faster than rendering from host memory.

Also depending on your card and ESPECIALLY if you're using host buffers, you have to be careful that you're not sending TOO many vertices at once. Check out the following page, and maybe just try splitting up glDrawElement calls to ~10,000 vertices or less just to test.
http://www.sci.utah.edu/~bavoil/opengl/batching/

There are a host of other things that could be causing slowdown related to stalling the graphics card and starving it of data. Try the things that I've mentioned and post back with more details if they don't work.

Good luck!
I have my world split up in smaller logical units. I always test complete logical units for visibility (one check for 64*64*6 vertices) and then, if the unit is visible, render all the vertices within the logical unit at once (24.576 vertices) - according to the link you posted, the amount of vertices should be ok, is it?
That's all within one single, big indexed vertex buffer.

I guess, I will try and see if splitting up the buffer itself into several, smaller buffers will result in a speed increase.

Since you mentioned it: Do you know any links to literature about dynamic VBOs by any chance? Sounds interesting ;)

I didn't have very much time today, but tomorrow I will tell if either VTune or splitting up the buffer did anything to fix the performance leak ;)

Thanks again

Chris
I take it you've already checked that none of them is doing a vsync on swapbuffers, right?

Mark
Quote: Original post by Hydrael
if the unit is visible, render all the vertices within the logical unit at once (24.576 vertices) - according to the link you posted, the amount of vertices should be ok, is it?
That's all within one single, big indexed vertex buffer.

Yeah you're usually fine if it's from a hardware buffer as well. It's only really from host memory that you can get nailed by huge batches.

Quote: Original post by Hydrael
Since you mentioned it: Do you know any links to literature about dynamic VBOs by any chance? Sounds interesting ;)

If you're using OpenGL VBOs, it's as easy as changing the usage hint when you create the buffer to GL_DYNAMIC_DRAW, GL_DYNAMIC_READ, etc. The best place for info is to go the the source: the ARB_vertex_buffer_object spec:
http://oss.sgi.com/projects/ogl-sample/registry/ARB/vertex_buffer_object.txt

If you're using Direct3D, it's also just a flag when you create the buffer, although they also encourage you to use some special flags when locking and unlocking the buffer. See the DirectX docs for more info (it is well-documented).

Quote: Original post by Hydrael
I didn't have very much time today, but tomorrow I will tell if either VTune or splitting up the buffer did anything to fix the performance leak ;)

VTune is useful no doubt but don't put too much stock in it for graphics profiling! Since the CPU/GPU are supposed to work fairly asynchronously, and VTune profiles only the CPU, you're not seeing what the GPU is really getting bottlenecked on. Indeed simple host functions that cause a sync between the GPU/CPU (like Flip, lock, etc) will often come up inordinately high on the profilers when it's really just that the GPU is still working on something.

I think NVIDIA has some tools to help profiling GPUs and the DirectX tool "PIX" may also be of help.

Keep us informed!

This topic is closed to new replies.

Advertisement