Advertisement

How to call native functions in C++-bytecode

Started by August 13, 2020 04:41 PM
31 comments, last by Juliean 4 years, 2 months ago

Ok, so I'm finally going to replace the current backend for my visual scripting with a proper bytecode-interpreter/VM. While I got far with my current approach of pretty much misusing the c++-template-system as a poor mans compiler is reaching its limit (mostly in terms of performance; which in term also limits my ability to implement certain new features).

While I think I do have a certain idea on how to tackle this transformation, on thing thats still a bit unclear is the topic of calling native functions via my own bytecode. So lets talk about the requirements first:

  • C++20, x86 on Windows (Visual Studio) is all thats needed for now (though I could switch to x64 if it makes things easier; I'll stick with VS 100%)
  • Users must be free to register eigther global functions. lambdas or member functions w/o being required to write wrappers (though template-metaprocessing as well as reflection can obviously be used to generate the necessary code)
  • Only interested in one calling-convention for now. I'm using cdecl, but I could do any one that is the easiest
  • I have a limited know of data-types that could be passed, though there are a few variants (ie. primitive types int/float/uint8_t by value or by ref; objecty by pointer or ref or ** for modifying what object is stored at a certain location; structs by value or ref or const&)
  • I don't really want to limit the amount of parameters to a function; but if it makes it easier for now I could have a limit of like 6 parameters for native functions

Ok, now to the question: How would I go about preparing the stack/arguments for calling a function that I have the address for, as well as all arguments on a stack? Lets maybe break the question down to a few key points:

  1. How do I even start? I looked a bit at the assembly that is involved in c++-code calling functions; but how would I achieve this myself? Do I have to write inline-assembly or something even more complicated?
  2. How would I go about passing arguments? Again I looked a bit at the assembly, and it seems for functions that take primitive types (starting easy), the compiler seems to be pushing arguments onto more and more registers. But like, how would I know which registers to use; what am I doing when I run out of registers and how do I handle more complex cases like struct-by-value etc…? While I could find that out probably by trial of error, I was wondering if there was some documentation/tutorial/articles that handles this in-depth?
  3. A mix of 1-2: Given any number of parameters not know until I run my own compiler, how do I generate the code necessary to handle those parameters in c++?

So, can somebody give me some pointers? I think I can still use my template-based approach for accessing arguments of the stack for the start so this shouldn't be a dealbreaker, but if I'm already going all-in for speed I'd just rather do it right from the start.

The ABI for different platforms varies. Windows x64 and Win32 are different. The x64 version is simpler as there has been less time for cruft to accumulate.

Here is the x64 calling convention: https://docs.microsoft.com/en-us/cpp/build/x64-calling-convention?view=vs-2019​

The 32-bit conventions support multiple calling styles, including cdecl, stdcall, fastcall, vectorcall, safecall, thiscall, syscall, and optlink conventions. Each one has variations on which registers are used, their ordering, their preservation, and setup/cleanup.

The overall process is basically:
Prepare your own system by saving registers you need, putting the arguments directly into the corresponding register, or the address of the argument into the proper register, or on the stack.

Juliean said:
How do I even start? I looked a bit at the assembly that is involved in c++-code calling functions; but how would I achieve this myself? Do I have to write inline-assembly or something even more complicated?

The easiest is cdecl format since it is done entirely on the stack, which can be easily done in an __asm block. Push the frame information (which helps with debugging and exception handling but may be turned off with compiler options), push parameters on the stack in the correct order, use the asm call instruction to the function address, then pop results and restore the call frame.

The other formats place variables in registers, potentially using general purpose registers, floating point registers, SIMD registers, and also using the stack as needed. The basic pattern is similar to those already mentioned. Save what you need to save including frame information, put your parameters into the appropriate locations, call the function address, do the appropriate cleanup for the calling convention to capture returned values and restore the frame pointer.

Juliean said:
How would I go about passing arguments? Again I looked a bit at the assembly, and it seems for functions that take primitive types (starting easy), the compiler seems to be pushing arguments onto more and more registers. But like, how would I know which registers to use; what am I doing when I run out of registers and how do I handle more complex cases like struct-by-value etc…? While I could find that out probably by trial of error, I was wondering if there was some documentation/tutorial/articles that handles this in-depth?

Yes, there are both tutorials and documentations. MSDN has rather comprehensive documentation, I linked to the x64 version, but for the 32-bit variety you need to look up each calling convention individually.

Juliean said:
I have a limited know of data-types that could be passed, though there are a few variants (ie. primitive types int/float/uint8_t by value or by ref; objecty by pointer or ref or ** for modifying what object is stored at a certain location; structs by value or ref or const&)

You will need to know exactly what type of data is being passed and expected on the other side. There are some automatic promotion rules which may need to apply, such as storing your uint8_t in a 64 bit RCX register. Structs/unions/classes and certain SIMD types are passed as pointers, either pointing directly to the object, or when it was otherwise optimized away placing the value on the stack and pointing to the stack location, then using that pointer.

Juliean said:
Given any number of parameters not know until I run my own compiler, how do I generate the code necessary to handle those parameters in c++?

That's going to depend on your code and the calling convention you're using.

Again, 32-bit cdecl would be the easiest as values are all placed on the stack. Unless you're getting fancy with return values you can loop with mov, push, mov, push, mov, push… until they're all in place. If you're using non-POD data types you may need a bonus value for a hidden parameter, but it's easy enough to implement.

More complex functions would involve passing a mix of data types like integers, floating point values, SIMD variables, and POD structures where different data types need to be placed in different registers. Even more complex is when there are more parameters than will fit in the number of registers, like the six parameter version you described when the four registers used would be rcx, rdx, r8, and r9, then placed on the stack in the shadow space.

Juliean said:
So, can somebody give me some pointers?

Make sure you're comfortable with at least the rudimentary aspects of assembly programming first.

Learning the basics is not a particularly difficult task, but does take some time and acclimation. The environment is quite constrained with a relatively small number of registers and commonly used operations. You won't need to learn all the obscure opcodes, just enough to understand what is going on. As far as assembly language goes, being able to manipulate data in registers, manipulate the stack, and call functions are straightforward tasks.

I cut my teeth on it self-taught in high school (early 1990s) and two college courses included assembly programming. These days it is rarely taught in school, although a few schools have kept assembly as an optional course. Bulletin boards (imagine dial-up websites if you don't know them) had references like Ralf Brown's famous Interrupt List, which started as a university professor's personal documentation and grew to an enormous online reference for assembly programmers, most of it still applies today. I had a printout of the thing when it was close to 400 pages, all printed on an old dot matrix printer at roughly 6 point font.

Advertisement

At the level of “your own bytecode” you should probably translate arbitrary function calls into generic elementary operations that can be combined to handle the buffers you use as the stack of your function calls:

  • Copy a value of a certain type from or to a representation with a certain format, either at an arbitrary position in memory or in a certain CPU register
  • Alter the SP register (possibly combined with writing to the stack, like in the PUSH x86 instruction)
  • The function call itself, presumably just a CALL instruction or something similar

Omae Wa Mou Shindeiru

@frob Thanks a lot, thats making things a lot clearer already! Even so, I'm still a bit confused about the details of:

frob said:
Again, 32-bit cdecl would be the easiest as values are all placed on the stack. Unless you're getting fancy with return values you can loop with mov, push, mov, push, mov, push… until they're all in place. If you're using non-POD data types you may need a bonus value for a hidden parameter, but it's easy enough to implement.

So with loop, do you mean loop within the asm-blocks or looping eigther within my bytecode or inside my bytecode-instruction before the actual call happens? I'm just unsure because I'm assuming that when I start to push my arguments on the stack, I cannot have my own code modify the stack as well as I need my arguments to be there when I issue the call-asm. Or is it just a matter to make sure that whatever my code does for pushing the argument is off-stack when I put the argument there via asm?

frob said:
You will need to know exactly what type of data is being passed and expected on the other side. There are some automatic promotion rules which may need to apply, such as storing your uint8_t in a 64 bit RCX register. Structs/unions/classes and certain SIMD types are passed as pointers, either pointing directly to the object, or when it was otherwise optimized away placing the value on the stack and pointing to the stack location, then using that pointer.

At least for having that information, its not a problem. And again, everything is pretty much streamlined to a few potential cases (everything else would be a compile-error. I just have primitive (int/float/uint8_t), string, enum class (arbitrary unsigned datatype), object/class-types, structs and thats it. No unions or SIMD-types. Obviously a few different modes by which arguments can be passed, but also limited (int& can be passed to signify a ref-argumentt; but const int& would be considered an error as convention for primitives is by-value). Its pretty much tailored towards the front-end representation of my unreal/blueprint-esque visual scripting; which cannot handle arbitrary values but more like a superset of types.
Well ok, there's also system/user-defined “supplied” values which are not represented in the graph like the current timestep, scene etc… but those also at least have a predefined mode for how they are passed.

frob said:
Make sure you're comfortable with at least the rudimentary aspects of assembly programming first. Learning the basics is not a particularly difficult task, but does take some time and acclimation. The environment is quite constrained with a relatively small number of registers and commonly used operations. You won't need to learn all the obscure opcodes, just enough to understand what is going on. As far as assembly language goes, being able to manipulate data in registers, manipulate the stack, and call functions are straightforward tasks.

I think I do understand the basics. I was looking at unoptimized assembly to get a basic understand of what I can/should do with my own bytecode. Not sure if thats enough though. At least for creating the bytecode format and my own stack myself I think I'm good for now, so I might just keep with my old format of calling functions to get started (that should mean that I won't need to use assembly). I'm still going to do it eventually, but that way I might get faster results. Currently my functions would be called by a wrapper, that via template-magic would expand to something like this:

void nativeFunction(int a, int b, float c);

void call_nativeFunction(CallState& state)
{
	nativeFunction(getArg<int>(state), getArg<int>(state), getArg<float>(state));
}

I think if I use that format, and instead of doing complicated stuff of dynamically looking up arguments and instead just put them on my own stack before the call and have getArg simply pop it, that might be good enough just to get things running.

LorenzoGatti said:
At the level of “your own bytecode” you should probably translate arbitrary function calls into generic elementary operations that can be combined to handle the buffers you use as the stack of your function calls: Copy a value of a certain type from or to a representation with a certain format, either at an arbitrary position in memory or in a certain CPU register Alter the SP register (possibly combined with writing to the stack, like in the PUSH x86 instruction) The function call itself, presumably just a CALL instruction or something similar

So essentially you mean having my own instructions for thuse kind of things, right? Aside from having push/pops onto my own stack I would then have separate instructions that would take values of the stack and put them into the designated registers and so on?

Juliean said:
Users must be free to register eigther global functions. lambdas or member functions w/o being required to write wrappers (though template-metaprocessing as well as reflection can obviously be used to generate the necessary code)

I think you have two options here, one is as discussed, you need to implement the proper ABI to deal with parameters and return types.

The other option is to only support a small number of specific signatures.

CPython for example I believe does this exclusively:

PyObject* print_score(PyObject* self, PyObject* args)
{
    const char *name;
    int score;
    if (!PyArg_ParseTuple(args, "si", &score))
        return NULL;
    printf("%s: %d\n", name, score);
    Py_RETURN_NONE; // Py_INCREF(Py_None); return Py_None;
}

In this way the byte code and implementation doesn't need to worry about different types etc., and is likely very similar to calling script functions, as all the data types are the same, in a way a script function is just a special exec native function that takes a block of bytecode as the first param.

On the C++ side you could potentially do clever stuff to automate some of that, but it is separate from the interpreter API itself, and doesn't have any special byte code representation.

For example a variadic template function can check the size of args, then unpack each.

SyncViews said:
In this way the byte code and implementation doesn't need to worry about different types etc., and is likely very similar to calling script functions, as all the data types are the same, in a way a script function is just a special exec native function that takes a block of bytecode as the first param. On the C++ side you could potentially do clever stuff to automate some of that, but it is separate from the interpreter API itself, and doesn't have any special byte code representation. For example a variadic template function can check the size of args, then unpack each.

Thats already how my current setup works ? At the moment, I invoke each function via a pointer to a global handler like so:

using ScriptFunction = void (*)(CallState&amp; state);

And internally, I ran some very complicated template-process that will essentially generate a function that looks like what I posted above:

void nativeFunction(int a, int b, float c);

void call_nativeFunction(CallState&amp; state)
{
	nativeFunction(getArg<int>(state), getArg<int>(state), getArg<float>(state));
}

(The one thing that I really do not wanna have to do is write wrapper functions myself. I started out like this, but ever since I'm able to do that:

	void SceneGameSystem::OnRegister(sys::StringView strFilter)
	{
		event::registerBindFunction(&SceneGameSystem::ActivateScene, strFilter, L"ActivateSceneNoBlock(scene)");

		event::registerBindFunction(&SceneGameSystem::GetActiveScene, strFilter, L"scene GetActiveScene");
		event::registerBindFunction(&SceneGameSystem::IsSceneLayerActive, strFilter, L"active IsSceneLayerActive(scene, layer)");
	}

There is no way I'm going back to writing those wrappers myself)

The main problem with the execution-approach in my current system is that everything is based upon calling those function, including primitives like conditions, loops, etc… and as its graph-based the entire flow of excection to be evaluated dynamically as well (there is a compiler, but its not really that complete).

So I do think I could keep my template-based binding code (by porting it to read from my new stack) to get started. I was just thinking that for performance, I would want to do native calls. Or is there not even that much difference?

Advertisement

Juliean said:
The main problem with the execution-approach in my current system is that everything is based upon calling those function, including primitives like conditions, loops, etc… and as its graph-based the entire flow of excection to be evaluated dynamically as well (there is a compiler, but its not really that complete).

I think that is fairly common with interpreters, if I recall Python and Ruby work that way for operators even, and in Rubys case most of the common loops (I tend to use `x.each do |y|` etc. a lot more than the language level `while` etc.). The main optimisation I recall is CPython stores the pointers to certain common functions directly in the type-struct objects, rather than a generic method table, and I think Ruby has an inline cache slot at every call site to save the function pointer as long as no types are modified, and in practice a lot of call sites will always be the same object type., so it ends up like

y = x + 5

// CPython effectively does this
x = x->ob_type->tp_as_number->nb_add(x, const_5);

// MRI Ruby does something like
// somewhere at compile/init time
static ID ID_ADD = rb_intern("+");
// call site
static int cache_global_method_state = -1; // global_method_state increments every time a method is defined somewhere
static int cache_type_id; // not sure exactly how Ruby defines this, I couldn't find a public API quickly, RubyVM.stat.class_serial is the counter I think
static const rb_callable_method_entry_t  *cache_func; // The cached method
if (global_method_state != cache_global_method_state || cache_type_id != type_id(x))
{
    cache_global_method_state = global_method_state;
    cache_type_id = type_id(x);
    cache_func = rb_search_method_entry(x, ID_ADD);
}
// y = cache_func(x, const_5);
VALUE y = rb_vm_call_kw(execution_context, x, ID_ADD, 1, &const_5, cache_func, RB_NO_KEYWORDS);

Juliean said:
Or is there not even that much difference?

There is some difference, but an interpreted language is always fairly slow to start with. And consider with dynamic types it can't know at compile time if “x + y” or “x.foo(a, b)” what x is, if it is a native function or a script function, if x even has such a method, if (a, b) is the correct param count, what the param types are, etc.

If you just push the argument count then a series of “PyObject” or “VALUE” style generic stuff onto the stack so the , that is basically what say the Python or Ruby is doing, the only real difference I see is them using a `PyObject *tuple` or `int argc, VALUE *argv` and letting the callee figure out if it is a valid call, while a native C function generally expects the caller to set the registers and stack up right.

Seems more like the sort of thing a JIT optimiser might do (and I haven't really looked at what say JS, Java or .NET do when they encounter either some sort of built in native function, or an extension/dllimport type thing), and JIT's I believe will create something more like statically typed code, and throw it out if something changes one of the type assumptions.

SyncViews said:
There is some difference, but an interpreted language is always fairly slow to start with. And consider with dynamic types it can't know at compile time if “x + y” or “x.foo(a, b)” what x is, if it is a native function or a script function, if x even has such a method, if (a, b) is the correct param count, what the param types are, etc.

The really bad thing in my case was that its not a problem about not knowing enough about the types upfront. I do have a very statically typed language, functions are all known at point of compilation (with the support for virtual calls). There are “wildcards” which are like generic/template-functions that only can have one generic argument. So there is no inherent slowness to how my language is represented, its just a matter of choosing an efficient representation format; and thats what I'm going for ?

I mean, I did go through various iterations of “using the same representation thats used for displaying the node-graph for execution” to “having a basic compiler that will compact from representable format to something more dense and optimized”. Though I'm still nowhere near the performance I want to have. Especially in debug its really bad, slower by like the factor of 75-100x (I'm not kidding). I don't have any direct comparisons to how other, (interpreted) languages would fair with what I'm doing, but I'm assuming you are supposed to get more than mid-complex player-controller script before your frametime goes to 3ms in debug builds ?

SyncViews said:
Seems more like the sort of thing a JIT optimiser might do (and I haven't really looked at what say JS, Java or .NET do when they encounter either some sort of built in native function, or an extension/dllimport type thing), and JIT's I believe will create something more like statically typed code, and throw it out if something changes one of the type assumptions.

Well, I'll be looking at JIT-ing my own bytecode to get even further; but only after everything is done. This transformation to bytecode alone will easily take weeks/months; as by this point my language is pretty far evolved so I have to lots of catch-up to get everything working again :D

Juliean said:
Users must be free to register eigther global functions. lambdas or member functions w/o being required to write wrappers (though template-metaprocessing as well as reflection can obviously be used to generate the necessary code)

When you write ‘users’ who are really? It seems to me you are referring to programmers using C++ (that is, engine programmers).

It happens I've done something similar before. Is your need to make a C++ function available to script?

Previously "Krohm"

Krohm said:
When you write ‘users’ who are really? It seems to me you are referring to programmers using C++ (that is, engine programmers). It happens I've done something similar before. Is your need to make a C++ function available to script?

I mean Engine and Plugin-Developers using C++, but at the moment what I really mean is just “me” :D

And yes, I want to make c++-functions available to script; without having to write a custom wrapper. Been there, done that, didn't like it, being able to write 1 call to register the function is so much sweeter. Also since thats already existing, I don't want to change it even further; but what I now want is optimally being able to replace my template-meta"compiler" with a bytecode-generator that'll setup the necessary state and call the function-pointer directly.

This topic is closed to new replies.

Advertisement