Advertisement

Are vector optimizations possible?

Started by May 20, 2024 11:52 AM
7 comments, last by JoeJ 6 months ago

We've had some reports of weird performance with our vector class (which is basically a thin binding to glm::vec3). See the following example:

vec3 v;
vec3 p(0.1f, 0.1f, 0.1f);
for (int i = 0; i < 100000; i++) {
	// Slowest: (15.0 - 15.4)
	// v += vec3(0.1f, 0.1f, 0.1f);

	// Faster: (7.9 - 8.1)
	// v.Add(0.1f, 0.1f, 0.1f);

	// Faster: (5.7 - 5.9)
	// v += p;

	// Fastest: (4.5 - 4.7)
	v.x += 0.1f;
	v.y += 0.1f;
	v.z += 0.1f;
}

I've commented the operations from slow to fast. It makes sense for the first one to be the slowest, because it has to construct a vec3 object before passing it to opAddAssign.

For testing, I've added an Add method that just takes 3 floats, which is slightly faster.

Passing an existing vec3 object to opAddAssign is a bit faster than that (because it only has to put 1 value on the stack instead of 3, is my guess?)

And then finally, manually inlining the code is the fastest. This is somewhat surprising to me, because I would've thought that 1 call into native code would be faster than 3 lines of script.

Do you have any idea what is going on here? Is there anything that could potentially make this faster in our bindings?

If this is expected and intentional, would it make sense to have some kind of inlining optimization? I found a thread from 2014 about script function inlining ( https://gamedev.net/forums/topic/661308-function-method-inlining/​ ), but it sounds like a difficult problem to solve.

For some extra info about our bindings, we use the following flags to register vec3: (I know we should probably be using asOBJ_APP_CLASS_ALLFLOATS but this doesn't seem to make a difference on Windows)

asOBJ_VALUE | asOBJ_POD | asOBJ_APP_CLASS_CDK | asGetTypeTraits<glm::vec3>()

And the following binding for opAddAssign: (this is using a helper class to call asIScriptEngine methods but there should be nothing surprising in the implementation of the helper class)

regVec3.Method("vec3 &opAddAssign(const vec3 &in)", asMETHODPR(glm::vec3, operator+=, (const glm::vec3&), glm::vec3&), asCALL_THISCALL);

I would have to investigate this to see what can be easily be optimized. There are multiple levels to look at:

  1. Bytecode optimizations: The raw output from the compiler is quite often full of overhead. I have logic for bytecode optimizations but it doesn't catch everything, so this is likely the first place where further optimizations can be done.
  2. VM runtime optimizations: Depending on the case it may be possible to find optimizations in the VM. Perhaps even introduce a new bytecode instruction if it makes sense.
  3. Runtime optimizations in the code for calling native calling conventions: It is not so surprising to me that a call to a C++ function can be slower than 3 script statements. Have you seen the amount of logic that goes into setting up the CPU registers and C++ stack in preparation for the call?
  4. Try the generic calling convention: Depending on the function, the generic calling convention with a wrapper function may actually be faster than directly registering the native function. This is because it may involve less runtime checks in the generic calling convention.
  5. Use of JIT compiler: Even BlindMind's rather old JIT compiler is capable of improving the runtime performance a few times.

The first 3 would be on me. But I suggest you look into 4 and 5.

PS. Indeed asOBJ_APP_CLASS_ALLFLOATS has no effect on Windows with MSVC. It is mostly used by gnuc/clang compilers.

PPS. Correct, function inlining is definitely not a small project and not something I will take on any time soon.

AngelCode.com - game development and more - Reference DB - game developer references
AngelScript - free scripting library - BMFont - free bitmap font generator - Tower - free puzzle game

Advertisement

Thanks for the tips! It does appear that using asCALL_GENERIC for opAddAssign brings performance to a level closer to the inlined case, which is great. Doesn't seem to change much for the constructor binding though unfortunately.

Doing it this way makes it a little harder to write vector bindings, but perhaps I can do some macro/template magic to make this easier.

The JIT compiler is an option and does somewhat improve things, but I've had random crashes with it not cleaning stuff up correctly. I don't remember what the problem is exactly but it's been problematic for me, to say the least. (I've been using the BlueCat Audio one as well which has similar problems.) I've considered writing my own JIT compiler as I noticed there's a new v2 interface for it, but I haven't really dived into the details yet.

Miss said:

Doing it this way makes it a little harder to write vector bindings, but perhaps I can do some macro/template magic to make this easier.

The auto wrapper add-on makes it quite easy. Basically you just replace the name of the macro used for taking the function address and replace the type of the calling convention.

Miss said:

The JIT compiler is an option and does somewhat improve things, but I've had random crashes with it not cleaning stuff up correctly. I don't remember what the problem is exactly but it's been problematic for me, to say the least. (I've been using the BlueCat Audio one as well which has similar problems.) I've considered writing my own JIT compiler as I noticed there's a new v2 interface for it, but I haven't really dived into the details yet.

Yes, the new JIT compiler interface version will allow better global optimizations in the JIT compiled functions as the JIT compiler will be able to work on all script functions at once, rather than just the single function feed to it by AngelScript as the script function is compiled.

I know of a new JIT compiler that is in the works using this interface and will use LLVM for producing the native machine code with optimizations. Unfortunately the author of that JIT compiler is doing it for a private company and he may not be able to open source it. However, he will try to get the permission to do so.

AngelCode.com - game development and more - Reference DB - game developer references
AngelScript - free scripting library - BMFont - free bitmap font generator - Tower - free puzzle game

That would be really great. Otherwise, I'd love to look into how much work it would be to implement myself. I've experimented with the new API for a little bit but haven't been able to get my JIT functions to actually get called yet. Not sure if I'm missing something there or if the new API is incomplete?

Miss said:
We've had some reports of weird performance with our vector class (which is basically a thin binding to glm::vec3).

Some years ago i tried glm for a fluid simulator.
Later i have replaced it with the library i'm using usually (Sony VectorMath, which is long gone).

Turned out the simulator was 10 times slower using glm. Yes, ten times.

I could not figure out the reason, but seems related to constructors, which showed on top of profiler report. However, it's not because of SIMD. Sonys lib has scalar and SIMD implementations, and using SIMD gave only an improvement of 25%.
The issue applied to both integer and floating point vectors, and there was lots of Mat3x3 usage as well.

Unfortunately i have tried only with MSVC, not with Clang.
But in general - as nice as it is - i can not recommend glm for game dev. : (

Advertisement

That's interesting, I don't know why glm would be 10 times slower either. I'd be curious to see an actual performance comparison between multiple different vector math libraries.

For what it's worth, my Add() custom test binding implementation doesn't invoke any glm code at all, it just does v.x += x; etc. (which may be slower, given that's not using intrinsics)

Miss said:
I'd be curious to see an actual performance comparison between multiple different vector math libraries.

Yes, would be nice.

But you can do some tests to see for yourself.

Sonys library was open sourced with the Bullet Physics Engine, but afaict it's no longer part of it.
Somebody did some maintenance here: https://github.com/glampert/vectormath

This topic is closed to new replies.

Advertisement