Yet another performance comparison (AS vs Small)

AngelCode Affiliates

Started by ziomau November 20, 2014 03:38 PM

22 comments, last by audioboy77 10 years, 1 month ago

ziomau

Author

115

November 20, 2014 03:38 PM

Hi all,

I know that this matter has been covered several times, but in all previous posts, there are clear indication of interest about this matter. So, having done my own performance comparison I submit it to the community.

I understand that the scope of this comparison is very specific, but that comes from a particular need I have.

I currently build automation machines with a real time control done with a PC. Machine personalization is done with a series of programs currently written in Small.

These programs are normally compiled into byte code, and the byte code is then injected into the real time environment where a virtual machine does the real execution. The virtual machine exposes to the script several system function call to operate on the physical machine hardware (reading/setting input/outputs, moving axes, setting variables, ecc ...)

Due to the limitations of the Small language (lack of structures, no typing, no doubles, ...) I am investigating the possibility to switch to AngelScript.

Having solved most of the interface problems, now I have an AngelScript compiler and a real time virtual machine for execution running, so I am able to make some performance comparisons.

I started with a very simple script.

Here is the AS version:


int TestNum;

void main(void)
{
  int count = 0;
  for(int col = 0; col < 10000; col++)
  {
    count++;
    ExtVar = count;     // Line A: Makes one call to external environment
    TestNum = count;    // Line B: Only set a script variable
  }
}

Here is the Small version:


new TestNum;

main()
{
  new count = 0;
  for(new col = 0; col < 10000; col++)
  {
    count++;
    Set_ExtVar(count);   // Line A: Makes one call to external environment
    TestNum = count;     // Line B: Only set a script variable
  }
}

As you can see they are very similar, and basically they are made of single loop making one local variable increment and than an assignment.

Line A an B that you see in the code are alternatives. Only one of them were present during the evaluation.

This are the results in mS:


           AS         Small
Line A    4.37         1.85
Line B    3.05         1.07

Looks like AS is two to three times slower than Small.

Few final notes:

The test was done on the same machine with similar load and same real time environment
I didn't investigate the JIT, because I am not sure it can even work in the real time environment
The external call is the same in both tests executing the same code, but it causes an additional 0.8mS in Small and 1.3mS is AS. Thus it is possible to conclude that the external function call is at least 0.5mS slower
The jitter between test runs is in the order of than 0.05 in all cases
The slight syntax difference between the two sources in line A is due to the fact that Small does not allow the definition of an external opAssing method, so en explicit function call is made. I believe that this result in a very similar byte code anyway.
I am under the impression that AS virtual machine makes some allocation/deallocation during the run of the test, even if everything looks static (or at least allocated at the beginning). I tell this because if I let the test run repeatedly for one hour,, when I shout down the real time system the call to engine->Release takes as much as 5 minutes to complete. I am not able at the moment to tell exactly why.

Should you need further information or have me make further tests, just ask.

Regards.

Mau.

PS: Unfortunately these results forces us to stay with current solution, but I will keep a eye of further AS development, especially in the performance area.

WitchLord

4,864

November 20, 2014 05:27 PM

Thanks for letting me know about the performance comparisons. I'll definitely look into this and see what can be done to improve the performance in AngelScript.

Did you already follow the instructions in the manual for getting the most performance out of AS?

Also, in some cases the generic calling convention can prove to be faster than than the native calling convention (due to less dynamic decisions to set up CPU registers, etc). So it may be worth it to implement the external opAssign method with the generic calling convention and see if that helps the performance.

I would also be very interested in understanding what it is that makes the engine take 5 minutes to complete on your machine. That doesn't sound normal at all, and may very well be related to some bottleneck in the execution itself. If you can help me figure out what the exactly the engine is doing during these 5 minutes it would be very helpful.

AngelCode.com - game development and more - Reference DB - game developer references
AngelScript - free scripting library - BMFont - free bitmap font generator - Tower - free puzzle game

iraxef

451

November 20, 2014 07:24 PM

I'm assuming you're running a decent compiler-optimization level and that it's the same settings across both tests?

For example, if it's a fully-debug build with optimizations disabled.. that's not a very good test.

theoutfield

291

November 24, 2014 04:05 AM

I use AngelScript as part of a real time control system as well. I've never heard of Small, but I can say that AngelScript has been very stable and works well in a real time environment. I have a simulator for the application on CodeProject that you can download http://www.codeproject.com/Articles/755009/Open-Dyno-Real-Time-Control-System maybe you can see how it compares to your current setup. I run an entire test cell (large machine) with a code execution time of about 0.2- 0.4ms. It's easier to measure these values on Linux than Windows though. I'm not sure what the support is like for Small. Andreas has been an excellent help over the years and continues to add new features to the language.

ziomau

Author

115

November 24, 2014 10:07 AM

Hi all,

thanks for all suggestions. I want just add a few clarifications.

I wasn't aware of the fine tuning page, but having a look at it now, I can tell that most of the suggestions were already included:

My functions does not use anything inside the virtual machine, so, no need for caching. I have already the threads and the GC disabled, but I did notice the following issue.

I do need to have the line callback, for debugging purposes. The minimum I need is a call to get the line number and module executing, i.e. a call to ctx->GetLineNumber.

But I noticed that while doing my evaluations I included also a call to ctx->GetCallStackSize and ctx->GetFunction that were not really needed, So I did again the same test with an empty callback first and with a single call to GetLineNumber next.

Here are the results:


           Empty     GetLineNumber
Line A      2.5         4.3
Line B      0.9         3.0

With the GetLineNumber call alone I got similar results as previous tests (meaning that probably GetCallStackSize and GetFunction are not particularly expensive) but there is a significant change removing the GetLineNumber call too. I ignore what this call does internally but it looks strange to me that is is so expensive. Current line and module, should be something at hand in the virtual machine. Probably the fact that it returns a string too causes linear search and string copy/allocate. Probably this is not needed here. Maybe it is worth to investigate here.

Just to make some small suggestion, I don't know how applicable here, but I can tell how Small manages this issue.

Small, after loading a script, requires a special scan of the script itself where the virtual machine patches the script with Ids instead of strings, function calls and other things. During this scan the virtual machine calls also the line callback function passing the corresponding id for each thing substituted, so the application can build a dictionary or a table. During real execution only Ids have to be manipulated. The advantage of this approach is that the real time environment can work with integers only, and only when there is a need to show something to the user (not in real time, obviously) a time consuming dictionary access and string manipulation is performed. Hope this helps.

Continuing with your comments (somewhat in order):

"Release call"

I do not know what causes the 5 minutes delay in the Release call. I cannot debug it since it is running in the real time portion of the code. I can do some further investigation placing some printout. I will be back with details, but I noticed that this behavior looks incremental. I mean, if I run the test for few seconds, that the Release call returns immediately. If I run the test for 10 minutes or more, than the Release call will take more and more time.

I suspect memory fragmentation or worse memory leak. In the real time environment the memory management is done differently in order to grant predictable time allocation. This means that some allocation/deallocation made during execution that may go unnoticed in standard environment, may cause problems in real time. I will make some more test in this area.

"Compiler settings"

I am using the same compiler, with the same setting and the whole application has only the minimum difference in the code base to be able to run the two different script. I am not trying to denigrate AS here. I am really trying to make my best efforts to be unbiased, and I am really willing to switch to AS, so I am really willing to see good numbers there ... I may have made mistakes, of course (like the line callback above) but I want to state here that I want to be as fair as I can.

"Performance measurement"

I am measuring the performance using the CPU clock (rdtsc processor instruction), so the result are precise to the nanosecond, and the accuracy is that of a typical crystal (~100ppm). And yes the speed step or other clock optimization/variations are switched off. Moreover the point of measure is the same in both cases. I am quite confident of the measuring method, also because it is the same timing method we are using since years to time internal happenings in the real time environment. Additionally this method is available in all operating systems since actually it is a processor instruction. A small fragment of inline asm allows it to be used in C/C++

"Support"

I had some support in the past from the Small community, but now they turned to a different way more oriented to application scripting and making the language less suitable for real time execution, So I froze my sources at some point (before the switch to the new Pawn naming) and evolved from there with my own code.

On the opposite I appreciate very much the support offered by Andreas at first and by the whole community here.

Good job!

Thanks all,

Mau.

ziomau

Author

115

November 24, 2014 12:11 PM

I made a quick test placing some time reference markers inside the engine destructor in order to track down the reason it take so long to complete in real time.

I placed a time marker between every major loop or instruction group inside the destructor code.

Here are the results (in mS) together with a small description to identify the code section.

Follow the ~asCScriptEngine source code for further details.


    1	Calling  asDELETE(const_cast<asCScriptEngine*>(this), asCScriptEngine);
    0	Calling ~asCScriptEngine
    0	Call to SetContextCallbacks
 1178	Loop on scriptModules[n]->Discard()
11997	Call to GarbageCollect
    0	if and call to defaultArrayObjectType->Release()
   13	Loop on templateInstanceTypes
  332	Call to GarbageCollect, FreeUnusedGlobalProperties, ClearUnusedTypes
    0	Loop on scriptTypes
    0	Call to GarbageCollect, FreeUnusedGlobalProperties, ClearUnusedTypes
    0	Loop on scriptFunctions
    0	Call to GarbageCollect, ClearUnusedTypes
    0	Test on refCount
    0	Test on ReportAndReleaseUndestroyedObjects
    4	Loop on mapTypeIdToDataType
  305	Call to RemoveConfiguration and loop on configGroups
  117	Iteration on registeredGlobalProps
    0	Loop on templateInstanceTypes
    0	Loop on allRegisteredTypes and on templateSubTypes
    3	Loop on registeredGlobalFuncs and 4 calls to ReleaseAllFunctions
    0	Loop on funcDefs
    2	Loop on stringConstants
    1	Loop on scriptSectionNames
    0	Loop on userData
    0	Loop on nameSpaces
   81	Call to Unprepare and return to main

The largest time it takes (12 seconds!) is the first Garbage Collect call, even if I should have disabled it and I have no object that exposes some GC behavior.

I believe I have not enough knowledge to understand this matter or to delve deeper inside the code, but if someone needs further investigation or some specific test, I am available.

Remember only that I cannot place breakpoints or examine variables in real time.

I can only put time markers, printouts, or counters to be dumped at the end or every now and then during operation.

Regards.

Mau.

WitchLord

4,864

November 24, 2014 07:35 PM

This is valuable information. Thanks a lot for sharing.

I've not really designed the LineCallback for high performance, I expected it to be used only when debugging, not when running the application at full speed. Do you really need to use it at all time?

The garbage collection that happens when shutting down the engine is to clear up the internal objects, for example potential circular references between two script class declarations, etc. It is not expected that this processing time increases the longer you execute the application, unless the scripts themselves create objects that require garbage collection (which is not the case with yout test script). Are you perhaps building the script over and over again? Are you requesting new modules more than one time in your application during the execution? Can you provide the same output again after letting the application run for a longer period? That way hopefully we'll understand what part of the code is taking up to 5 minutes to execute, and thus is the related to the cause of the slowdown. Also, it may be helpful to print the information that asIScriptEngine::GetGCStatistics() provide.

Perhaps you can share the source code for your test application that uses AngelScript, so I can take a look at how you've implemented it? If you don't want to share it publically you can send it to my e-mail (andreas@angelcode.com).

Based on your feedback, I've already begun looking at what can be improved in the library. On my computer I had to increase the number of iterations to 10 million to get close to 1 second in the execution time

. I've identified a few things that adds unnecessary overhead with each call to an application registered function (the opAssign method in your case). I'll be working on reducing this for the next release.

The hardware you have is really slow (about 1000 times slower that a standard desktop computer), and I can understand that the garbage collection at the shutdown takes a long time on your machine. I'll definitely need to look into reducing the amount of work needed by the GC (if possible avoid it all together).

By the way, at what moment do you need to shutdown the script engine? If it is when you're also shutting down the application, then perhaps you can skip releasing the engine all together since the memory will be automatically reclaimed by the system when the application shutsdown anyway.

AngelCode.com - game development and more - Reference DB - game developer references
AngelScript - free scripting library - BMFont - free bitmap font generator - Tower - free puzzle game

ziomau

Author

115

November 27, 2014 09:00 AM

Hi some more comments, before doing some more tests.

"I've not really designed the LineCallback for high performance"

Actually, most of the time my LineCallback does nothing. But sometimes, during machine setup, we need to show what the real time code is doing.

Since I cannot stop the execution, what I do is to collect the line that are executed at each code scan, to show them highlighted in the editor. That's why I need to collect the line number, and the function executing, nothing more. Sometimes I need to inspect the value of some global variable, but I believe this can be done asking the engine, or execution contest and I can do that in relaxed time.

What look still strange to me, as I said, is that the information I need should be at hand in the execution context, so why it takes so long?

"The garbage collection that happens when shutting down ..."

I will try to make a printout with GetGCStatistics(), and submit it here. I will try to run for longer time too.

"Perhaps you can share the source code ..."

The source will not even compile without the real time extension. I will try to send to you some fragment to show how I initialize and manage the virtual machine, in order to have you check if I made something very wrong.

"I had to increase the number of iterations to 10 million ..."

Probably you missed the fact that my timings were in mS. You increased the loop 1000 times to get into a seconds range. That means that my PC is running at the same speed as yours, or at least in the same magnitude order.

As I described I measure time using the processor clock. In case the processor clock does not change, the timing is very precise. Some green technology allows the processor to reduce its own speed when under low load, but this normally may be disabled with a bios switch. In our case having to run in real time, we need fastest response even in case the processor has nothing else to do, so we normally disable all "green" setting in the bios.

The TSC is an internal 64bit counter that counts up from zero when the processor is powered up by the processor clock. It can be read in C/C++ with this simple code fragment (MSVC):


inline __declspec(naked) __int64 RdPTime(void)
{
  _asm {
    rdtsc
    ret
  }
};

In Linux (GCC) I have made some test some time ago. I cannot bet it is still working, but here is the code:


inline int64_t RdPTime(void)
{
  int64_t ret;
  
  rdtscll(ret);
  return(ret);
};

My measurements are done getting the counter just before and just after the code I need to time and taking the difference. The minimum granularity is the processor clock. With nowadays 2Ghz machines or more we are talking about less than a nS (10^-9 Sec). Normally far more resolution than necessary. The only problem is that usually the processor clock is not a know measure. A mainboard sold as 2.8Ghz nominal, may actually generate a clock of 2753Mhz, so if you need to precisely convert ticks to seconds you need to know the exact clock running. Normally I compare the TSC counter difference with a OS known timing (Like Sleep, or other suspend with timeout) for one second,and that should be precise enough for our needs.

You get the idea ...

"... at what moment do you need to shutdown the script engine?"

Actually you are right. I need to shut down the engine only when the application is closed. Actually, on real machines this will never happen. Machines are just turned off ... no shutdown procedure. So this may be a non problem for me.

Nevertheless I prefer is everything works cleanly, before committing to this solution. I don't want to hide maybe some memory leak or a memory fragmentation problem hiding in the code, that may byte me back in some other situation.

Automatic machines may run for days or even weeks without being shutdown, so any problem that tend to increase in time is a no go.

Thanks all.

I will be back with further investigations in the shutdown area.

Regards.

Mau.

ziomau

Author

115

November 27, 2014 02:16 PM

I nailed down a little bit my problem, that happened only if I reload my script several times.
I explain a little what I am doing:
In my environment I have a single asIScriptEngine class instantiated.
This engine is initialized once with the following code:


  engine->SetDefaultAccessMask(0xFFFFFFFF);
  RegisterScriptMath(engine);         // Register math and trig functions add on
  RegisterEngineProperties(engine);   // Register engine properties
  RegisterStdString(engine);          // Register string class add on
  RegisterScriptArray(engine, true);  // Register generic array add on
  RegisterSystemVarsDecl(engine);     // Register 3 system class types
  RegisterSystemEnums(engine);        // Register about 60 system enums with a total of about 1000 values
  RegisterSystemFunctions(engine);    // Register about 100 system functions
  RegisterSystemVariables(engine);    // Register about 400 system variables of the declared types
  RegisterUserVariables(engine);      // Register about 350 user variables of the declared types
  RegisterUserIo(engine);             // Register about 100 IO definitions (one of the declared types)

If needed I can show the classes I am using, but I can say they are very simple, with only few methods and operators declared.
Actually they simply represent a system integer, float and boolean types respectively. Variables of these types live outside
the script and are accessed by the script when needed through the instance declared during the engine initialization.

Then I create a new module, load the byte code (compiled and saved separately), create a context, set the line callback.
Apart from error checking, and some simplification the code is:


  mod = engine->GetModule(modname, asGM_ALWAYS_CREATE);
  mod->LoadByteCode(&bs);
  ctx = engine->CreateContext();
  ctx->SetLineCallback(asFUNCTION(LineCallback), &lplc, asCALL_CDECL);
  mod->ResetGlobalVars(ctx);
  mainEntry = mod->GetFunctionByDecl("void main(void)");
  ctx->Prepare(mainEntry);

After this, the real time loop executes the code repeatedly every real time tick as in:


  while(WaitForNextInterrupt())
  {
    st = ctx->Execute();
    if(st == asEXECUTION_FINISHED)
      ctx->Prepare(plc->mainEntry);
  }

If I leave the above code running for long time this does not show the problem.
If I shutdown the program immediately, or even after one hour the results are the same:
I have, in both cases, a GC statistics printout like:


GetGCStatistics: currentSize:732, totalDestroyed:0, totalDetected:0, newObjects:732, totalNewDestroyed:0

and the full shutdown takes about 600mS.
This time is surprisingly long, for what I expect should be done, but at least it is not minutes!
And the fact that it does not show an increase with time, makes me feel I am on the safe side here.

Now, regarding the problem ...
Every now and then, while the real time environment is still running, we need to change/fix, our code and,
as fast as possible, replace the running version of the code with the fixed code.
Apart from some synchronization between load phase and real time execution, and the obvious error checking, basically the code does
the same sequence as above from the GetModule up to the Prepare included.
But, this time the code creates a new module, and uses a different set of mod and ctx, leaving the current ones alone
(they may be actually running by the real time tick).
Then, I wait until the engine is not executing any code (between a real time tick and the next), and atomically swap the two sets of
ctx and mod.
Thus the next real time tick, the engine that was sitting in the while above executes the new code.
After we are sure everything went ok, the old mod and ctx are then destroyed/released.
For completeness I must say that the preparation code and the execution are done on different threads.

Setting or not the AS_NO_THREADS does not change the behavior as far as this problem is concerned.

When I reload the code again and again, the GC objects start increasing at about 700 object each time,
and proportionally increases the time it takes to shutdown.

The sequence I tested demonstrates that each time the preparation code is executed, about 700 objects hang out in GC.

Things get worse when the reload is repeated, but even the first time we have those 700 objects around.

So the point is:
Why I have about 700 objects in the garbage collector when I explicitly disabled it?
Additionally, Is the above sequence correct, or should I call some other cleanup function during the load/reload process?

Sorry for these long posts, but I am trying to be as clear and detailed as possible.

Thanks for any help.

Mau.

WitchLord

4,864

November 27, 2014 11:06 PM

Ah, now it is making much more sense. :)

The reason for the increase of objects in the GC for each recompilation, is that every time a module is discarded the script functions, classes, etc are placed on the garbage collector so it can resolve the circular references between functions etc. Since you've turned off the automatic garbage collection (with SetEngineProperty(asEP_AUTO_GARBAGE_COLLECT, false)) and don't manually call the GarbageCollect() methods these objects are never cleaned up until you shut down the engine.

You just need to add a call to GarbageCollect() after you discard the module in order to avoid having the objects accumulate. Since you are already allowing reloading the script during execution, this should not be a problem. Though if you feel that the GarbageCollect() is too expensive to do in a single go, you can call it incrementally over a longer period, thus avoid impacting the performance of the real time application.

I will definitely look into reducing the amount of objects that will be placed on the garbage collector after discarding a module. Though, recompiling (or in your case, reloading) scripts shouldn't really be considered as something you'd do in a real-time environment. ;)

Even if your LineCallback normally doesn't do anything, it is still a function call that will add overhead to the execution time, thus degrade the performance. You should only set the LineCallback when you actually need to inspect the execution, and then remove it when no longer inspecting it. That way you will only have a performance impact when you're inspecting the execution. You shouldn't need to stop the context to set or remove the LineCallback. It ought to be possible to set it from a second thread even while the context is running in the main thread.

The GetLineNumber() call is relatively slow because contrary to what you think this information is not readily available to the context. It has to be looked up by doing a binary search mapping the current bytecode position to the line number. Normally the context has no need to know the line number, so it doesn't make much sense to keep it in a faster but more memory consuming structure. You can see what goes on when calling GetLineNumber() in the as_scriptfunction.cpp (asCScriptFunction::GetLineNumber)

AngelCode.com - game development and more - Reference DB - game developer references
AngelScript - free scripting library - BMFont - free bitmap font generator - Tower - free puzzle game

Yet another performance comparison (AS vs Small)

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Yet another performance comparison (AS vs Small)

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines