Advertisement

C++ Operator Overloading, and why it blows chunks

Started by May 29, 2000 09:39 AM
60 comments, last by MadKeithV 24 years, 6 months ago
Heyyyy, you''ll think, Mad Keith posting a flame about C++? That''s right. Until recently the most adamant supporter of C++, I''ve reached other conclusions. The problem? Something we''ve probably all worked on at some point or other: Vector and Matrix functions. It started like this: I downloaded a class-template library for vector/matrix math (among other things). It seemed so damn useful at the time, and it was so easy to program using it. Vector1 = Vector2 + Size*Vector3; Looks wonderful doesn''t it? Well, it may LOOK wonderful, but it''s slower than a dog. The problem is the copy constructor you''ll inevitably need for these operators to work. Inlining helps a bit. Reorganising your code helps a bit. Looking from afar it actually seems it''s efficient enough, within a few percent of using non-operator style functions. Well it isn''t. A full weekend of optimising my class code, trying out every compiler optimisation, every class trick, even buying a BOOK on efficient C++, my "typedefs and Macros" version takes about ( not kidding ) 50% of the time of the C++ version. *sigh* Just thought I''d warn you all before you waste a perfectly good weekend ( birthday weekend in my case ) trying to figure out why that C++ code just doesn''t seem as fast as it should be. #pragma DWIM // Do What I Mean! ~ Mad Keith ~
It's only funny 'till someone gets hurt.And then it's just hilarious.Unless it's you.
Yep, there are a few speed gotchas in C++. But I'd still rather use it over C. Besides, its still faster than most other languages out there.

--TheGoop

Edited by - TheGoop on May 29, 2000 12:19:08 PM
Advertisement
What compiler were you using? If you were using MSVC, make sure that you were compiled in release mode for your timing tests. it does not inline code when compiling in debug mode, and I''ve seen speed hits of as much as 400%!

Also, make sure that you C++ is "const correct" and uses
references instead of passing parameters.

Using the postfix ++ operator is slower with user-defined (class) types because you have to make a copy. You CAN write code like your "typedefs and macros" version in C++ and still have it be a) fast, and b) typesafe -- it just takes a little more knowledge of the language.

Good luck (whether you decide to give C++ another try, or not ;-)

-- Pryankster (no better preacher than the converted)
-- Pryankster(Check out my game, a work in progress: DigiBot)
If somebody can prove me wrong on this, I''ll be terribly happy...

But I tried very weird tricks, used every compiler optimisation, and CERTAINLY used Release Mode ( yep, VC++ ).

Actually I''m compiling as C++ in both cases, I just don''t use a single class in the fastest case.
My operators all look like

inline CVector operator+( const  CVector& v ){return CVector( m_x + v.m_x, m_y + v.m_y, m_z + v.m_z );} 

with that constructor inlined too, and without a body but using only the initialiser list.
That brings the speed difference down to about 60% - still a very significant hit, without apparent cause. The profiler output makes me think that it HAS indeed inlined everything, I can only assume that even the inlined constructor code is too slow for a function of this size.

What I did find, is that when I wrote an inline member function (not an operator) to perform the sequence of operators that I used most often, the difference dropped to about 30%.
The problem I had with that, was that i was throwing away the nice, readable syntax of operators, in favor of something not much better than using C-style functions or just plain macros.
So I stuck to macros!

However - my memory manager (something that really showed up as the culprit in the timings I must say ) will be a class, since it will contain mostly static members the speed loss will not be very significant compared to C-code, and it will be MUCH more readable.




#pragma DWIM // Do What I Mean!
~ Mad Keith ~
It's only funny 'till someone gets hurt.And then it's just hilarious.Unless it's you.
What level have I sunk to, replying to my own posts

I wanted to add the following:
I am not saying that C++ is evil, in all circumstances. Just in some. Especially when you are doing short functions, on small amounts of data, the C++ overhead can be prohibitive.
For instance, the code I''ve been referring to in this post will be used to generate interpolated vertices (see also "complex polygon pipelines" ). Using the NEW version, I can easily go up to 256*256 vertices per polygon without a significant hit, something I thought was impossible when using the old C++ code.

For the framework, and the large, not-so-often used data entities, I''ll still be using classes. They are too useful in non-performance-critical parts of the system.



#pragma DWIM // Do What I Mean!
~ Mad Keith ~
It's only funny 'till someone gets hurt.And then it's just hilarious.Unless it's you.
There is a simple reason for why the code that uses operator overloading is slow. Consider what happens, step by step, with an expression such as a = b + c.

If a,b,c are integers, the machine level instructions will work out to be roughly something like:

move b into processor register
add c to register
move register value to a

That''s only three simple instructions.

Now consider a simple wrapper class around an integer, CInt, with an overloaded + operator something like:

return CInt(m_iVal + arg.m_iVal);

The instructions for the expression a = b + c with a,b,c as CInts now would (roughly) look something like:

move b.m_iVal into processor register
add c.m_iVal to register
push register value onto stack
reserve space on stack for temporary CInt (CInt(...))
call CInt constructor (with b+c as parameter)
-- The contructor will atleast assign the parameter to
the internal value.
call = operator for "a" (with temp CInt as argument).
-- The = operator will contain atleast one assignment
instruction.

As you can see, this is atleast eight instructions. The exact number will vary based on compiler optimizations, processor, etc. but you get the idea. Also, constructors that do more complex things (such as memory allocation) will just add more time.

Personally, I prefer the operator overload method even though it''s slower unless it''s in performance critical code (such as the inner loop of a texture mapper). In the case of performance critical code, I wouldn''t use any kind of macros or functions anyway because I like to hand optimize heavily in those cases.

If you do alot of performance critical mathematical coding, you might want to do a search on the internet for C++ Template Metaprogramming. I heard about this recently but haven''t had a chance to fully look into it yet. There''s a site that discusses this indepth at http://oonumerics.org/blitz/ -- check out the Papers section for introductions to the technique.


Advertisement
Some good comments so far, both in this thread, and by email.

Anonymous poster - your breakdown of the assembly-level code is pretty correct, except that my compiler ( Visual C++ ) does what''s called a "return value optimisation".

In the case of a = b + c, it''s smart enough to not store b + c in a temporary, but work straight on a. It''s a useful optimisation because this situation is going to occur a LOT of the time. Yet, the constructor is still called on a, as far as I can tell, I cannot explain away the difference otherwise.

I had another valid comment.
If you use a function, rather than an overloaded operator, it becomes a lot faster.
I noticed this myself ( I think I mentioned it in one of my earlier messages ).
However, then I''d much rather use C-style functions or macros, because to get the most performance out of my C++ class, I have to use initialiser lists.
In my case, I''m working with vectors, and my preferred definition is something like
typedef float vec_3f[3]; 

I can''t do this in my class, because initialiser lists do not work with arrays. Using the C-style functions and the typedef, I can index into my global allocation array for vectors/vertices, another performance optimisation at a higher level.

So in the end, for my Vector and Vertex functions, I''m now sticking with typedefs and simple structs, with inline functions and macro''s with parameters.
As soon as I get OUT of the performance critical inner loop, I''m going back to using classes, as it''s still a lot simpler and more intuitive than the strange jumble of code to make the inner loop work fast.


#pragma DWIM // Do What I Mean!
~ Mad Keith ~
It's only funny 'till someone gets hurt.And then it's just hilarious.Unless it's you.
Hej,

have you tried the op=() operators like

+= *= -= /= etc... and write a=b; a+=c; instead a=c+b?

In simple instances (like a=c+b) the return-value optimization "should" please your needs - however your results frighten me

Another topic... you mentioned these "damned problem" that it is impossible to initialize arrays in the initializer section of a constructor. You could trie the following:

class Vector4F32
{
float x,y,z,w;
public:
Vector4F32():x(0), y(0), z(0), w(0){};
float* castToArray() {return &x };
const float* castToArray() const (return &x;

float& operator[](unsigned int index_){assert(index_>=0 && index <= 4}; return &x + index_ ; };

// the same for as a const operator...

};


If you dislike the method above you yould also define a union with an array and x,y,z,w inside - however this seems to be overly comlicated to me.

Hm, Meyer writes in "Effective C++" that the initialization of build in types while constructing a class and setting the values latter on wouldn''t hamper performance. I am unsure if this is right or not but it shouldn''t take to long to write a little test case - oh, it looks like you have done this "write a little test case" exesivley this weekend - oh, and happy birthday afterwards :-)


Bjoern
Thanks for the happy birthday

I made a little test-setup to make sure I was correct with the Initialiser-List optimisation, here are the results:

The classes used to test:class InitialiserList{public:  inline InitialiserList( float x, float y ) : m_x(x), m_y(y) {};  float m_x, m_y;};class Ctor{public:  inline Ctor( float x, float y ) { m_x = x; m_y = y; };  float m_x, m_y;};Results: ( 50.000 news with semi-random data each )Initialiser list: 72k ticksStandard constructor: 69k ticks  


So it looks as if you're right Bjoern - initialiser lists are not necessarily faster! In fact, in this case almost consistently slower ( multiple runs show both methods to be very, very close, and I'm too lazy to look at the disassembly )

Now, this makes using an array as a member more viable once more, so I made another quick test setup:
class InitialiserList{public:  inline InitialiserList( float x, float y, float z ) : m_x(x), m_y(y), m_z(z) {};  float m_x, m_y, m_z;};class Ctor{public:  Ctor( float x, float y, float z ) { Components[0] = x, Components[1] = y; Components[2] = z; };  float Components[3];};In this case:Initialiser list: 72k ticksConstructor/Array: 75k ticks 


A slight performance hit because of the array indexing I'd assume, but not that big - I'm going to work from here, and see if I can speed it up any more.

#pragma DWIM // Do What I Mean!
~ Mad Keith ~

Edited by - MadKeithV on May 30, 2000 5:11:21 AM
It's only funny 'till someone gets hurt.And then it's just hilarious.Unless it's you.
Let''s go over the code step-by-step. Here''s the statement, right?


int a, b, c; // random data
a = b + c;



This tells the compiler EXPLICITLY to add the values of b and c together, creating a temporary variable d, and assign the value of d to a. The compiler will optimize it when dealing with intrinsic types, but not user-defined types.


quote: Original post by MadKeithV

In the case of a = b + c, it''s smart enough to not store b + c in a temporary, but work straight on a. It''s a useful optimisation because this situation is going to occur a LOT of the time. Yet, the constructor is still called on a, as far as I can tell, I cannot explain away the difference otherwise.


That is wrong. It can''t be "smart enough" to work straight on a, because it requires a temporary object. An int can be fit into a register, but a vector? Even with everything inline, it still requires a temporary object. First, the compiler is calling YOUR code for operator=(), which means that the compiler MUST create a temporary object to pass as the second parameter. Your operator+ definition also mandates that the constructor be called to generate a copy and return it. These two temporary objects may be optimized to be the same temporary object, but there will always be a temporary object. Even if you didn''t provide an operator=(), the compiler will generate one for you, and a temporary object will still need to be constructed.

You will have to use member functions if you wish to tell the compiler that you do not want a temporary object. Or you could create a static vector in the vector class, and use that as a cache, where you return its reference from operators. This would eliminate the constructor calls but put an extra pointer on the stack (the cost of which could probably be optimized away.) Note that this is NOT in accordance with the ANSI standard, which dictates that a temporary object must remain valid for the extended expression (basically until it hits a ";"). I don''t see where it would cause a problem though.


class vector
{
public:
inline vector(const vector& other_vector)
: x(other_vector.x),
y(other_vector.y),
z(other_vector.z) {}

static inline vector& operator+(const vector& this_vector, const vector& other_vector)
{
cache_vector.x = (this_vector.x + other_vector.x);
cache_vector.y = (this_vector.y + other_vector.y);
cache_vector.z = (this_vector.z + other_vector.z);

return( cache_vector );
}

inline vector& operator=(const vector& other_vector)
{
x = other_vector.x;
y = other_vector.y;
z = other_vector.z;

return( *this );
}

protected:
float x, y, z;

// cache
static vector cache_vector;
};



Might want to run that through a profiler and some testing to make sure it''s faster (probably not).



- null_pointer
Sabre Multimedia

This topic is closed to new replies.

Advertisement