Advertisement

How to call native functions in C++-bytecode

Started by August 13, 2020 04:41 PM
31 comments, last by Juliean 4 years, 3 months ago

Juliean said:
Hold on, now. Thats already what I'm doing (=writing my own OpCodes)

Ah ok, maybe I misunderstood the term “asm-blocks” you used in the first few posts ?

Juliean said:
that its still slower than native code

It will always be slower than native code written in C++ at least. Look at how long Microsoft worked on C# and it still is considered bad for huge computational tasks like verifying a Bitcoin transmission (which is horrybly slow, I tried that).

Ok so I might not got what exactly is your issue in calling functions from your OpCode. My route to achieve that would have been to have a call instruction or maybe two, one for calling a script and one for switching into C++ and finally some meta instructions that know about the signature of the function or you use a generic wrapper method or a wrapper struct that is not more complicated to the user than a simple function pointer, like I did for my embedded console.

Especially if you limit the calling convention to only cdecl for now, this would be a lot easier by an implicit ctor added to the wrapper struct.

template<typename ret do_if(ORDER, _separator) variadic_decl(typename Args, ORDER)> struct StaticCallContext<ret (variadic_decl(Args, ORDER))>
{
    public:
        typedef ret (*FunctionPointer) (variadic_decl(Args, ORDER));

	...

        template<FunctionPointer type> static force_inline void AnonymousFunctor(void* target, void** args)
        {
            *reinterpret_cast<typename SE::TypeTraits::Const::Remove<typename SE::TypeTraits::Reference::Remove<ret>::Result>::Result*>(args[ORDER]) = type(variadic_deduce(Args, args, ORDER));

            (void)target;
            (void)args;
        }
};

Juliean said:
Hold on, now. Thats already what I'm doing (=writing my own OpCodes)

Ah ok, maybe I misunderstood the term “asm-blocks” you used in the first few posts ?

Juliean said:
that its still slower than native code

It will always be slower than native code written in C++ at least. Look at how long Microsoft worked on C# and it still is considered bad for huge computational tasks like verifying a Bitcoin transmission (which is horrybly slow, I tried that).

Ok so I might not got what exactly is your issue in calling functions from your OpCode. My route to achieve that would have been to have a call instruction or maybe two, one for calling a script and one for switching into C++ and finally some meta instructions that know about the signature of the function or you use a generic wrapper method or a wrapper struct that is not more complicated to the user than a simple function pointer, like I did for my embedded console.

Especially if you limit the calling convention to only cdecl for now, this would be a lot easier by an implicit ctor added to the wrapper struct.

template<typename ret do_if(ORDER, _separator) variadic_decl(typename Args, ORDER)> struct StaticCallContext<ret (variadic_decl(Args, ORDER))>
{
    public:
        typedef ret (*FunctionPointer) (variadic_decl(Args, ORDER));

	...

        template<FunctionPointer type> static force_inline void AnonymousFunctor(void* target, void** args)
        {
            *reinterpret_cast<typename SE::TypeTraits::Const::Remove<typename SE::TypeTraits::Reference::Remove<ret>::Result>::Result*>(args[ORDER]) = type(variadic_deduce(Args, args, ORDER));

            (void)target;
            (void)args;
        }
};

My code looks a bit template/macro heavy because I aim to set on the '99 standard and baerly C++11 features. My deduce macro looks like this

...
#define XP_VARIADIC_DEDUCE_ARGSVAL19(type_name, args_name) XP_VARIADIC_DEDUCE_ARGSVAL18(type_name, args_name), *reinterpret_cast<typename SE::TypeTraits::Reference::Remove<JOIN(type_name, 18)>::Result*>(args_name[18])
#define XP_VARIADIC_DEDUCE_ARGSVAL20(type_name, args_name) XP_VARIADIC_DEDUCE_ARGSVAL19(type_name, args_name), *reinterpret_cast<typename SE::TypeTraits::Reference::Remove<JOIN(type_name, 19)>::Result*>(args_name[19])
#define XP_VARIADIC_DEDUCE_ARGSVAL21(type_name, args_name) XP_VARIADIC_DEDUCE_ARGSVAL20(type_name, args_name), *reinterpret_cast<typename SE::TypeTraits::Reference::Remove<JOIN(type_name, 20)>::Result*>(args_name[20])

#define variadic_deduce(type_name, args_name, count) JOIN(XP_VARIADIC_DEDUCE_ARGSVAL, count)(type_name, args_name)

The code simply assumes an array of type void* which points to the arguments passed into the surrounding call like I do in my flex-type (which is like C# dynamic) or a stack or whatever location you save the parameters to. I don't perform further checks for signature or number of arguments so maybe you want to do that in your code. My macro code then simply deduces the array acording to the number and type of template arguments, which are named by the variadicdecl macro from 0 to _N (where N is the amount of arguments), thats why the deduce macro is able to access them for each position and finally performs a reinterpret cast.

I support different calling conventions as same as member-function pointers so I had to split those helper methods into different templates but as I wrote above, if you intend to support just a single calling convention, an implicit ctor of a template struct might simplify everything for you and your users

Advertisement

Shaarigan said:
Ah ok, maybe I misunderstood the term “asm-blocks” you used in the first few posts ?

Yeah, that was just for setting up the state for the native call, where my own CALL-opcode would not call a function pointer to a wrapper but really the address of the native function myself.

Shaarigan said:
It will always be slower than native code written in C++ at least. Look at how long Microsoft worked on C# and it still is considered bad for huge computational tasks like verifying a Bitcoin transmission (which is horrybly slow, I tried that).

Thats totally fine ? I always have the option to write performance-intensive parts in C++, I'm just ok with not being as horribly slow than what I
had before. If I ever needed even more performance in builds I could still think about code-generation like IL2CPP does.

Shaarigan said:
Ok so I might not got what exactly is your issue in calling functions from your OpCode. My route to achieve that would have been to have a call instruction or maybe two, one for calling a script and one for switching into C++ and finally some meta instructions that know about the signature of the function or you use a generic wrapper method or a wrapper struct that is not more complicated to the user than a simple function pointer, like I did for my embedded console.

My “issue” is just that I didn't want to have a wrapper at all :D If I have a global function at address 0x074356af then I wanted my CALL-opcode to be able to issue a CALL-instruction that invokes 0x074356af. But the more I think about it, the more I'm ok with keeping my wrapper. There are lots of features that go beyond what just the type-system knows (ie. being able to accept arguments as std::pair/tuple), and having to do the generation of all that so that I can really have my native-call happen is probably way too much for now. I'll just start using my existing wrapper, with slight modifications and have my OpCodes::Call simply invoke that, probably storing the call-arguments on the stack.

Shaarigan said:
The code simply assumes an array of type void* which points to the arguments passed into the surrounding call like I do in my flex-type (which is like C# dynamic) or a stack or whatever location you save the parameters to. I don't perform further checks for signature or number of arguments so maybe you want to do that in your code. My macro code then simply deduces the array acording to the number and type of template arguments, which are named by the variadicdecl macro from 0 to _N (where N is the amount of arguments), thats why the deduce macro is able to access them for each position and finally performs a reinterpret cast.

I think my setup will need to be a little more complicated as I natively support types that are smaller than one word (like bool and uint8_t). And I would also like to have primitive-types being put directly into the parameter-store (instead of having a void* to a bool). That would mean that I have to deal with arguments being at variable locations based on the arguments that come before; unless I simply requires small types like bool to always be expanded to a word (though when I'm thinking about 64bit then I'm not sure I want to have all small primitive-types expand to 8 bytes).

Shaarigan said:
I support different calling conventions as same as member-function pointers so I had to split those helper methods into different templates but as I wrote above, if you intend to support just a single calling convention, an implicit ctor of a template struct might simplify everything for you and your users

Not sure I fully understand your code, but I don't see how I would need different templates/helper-methods for different calling conventions in my own system. As the variadic-arg expansion should work with any convention; or are we now talking about how we ourself handling the calls from bytecode and not how the native functions are declared?

Juliean said:
but I don't see how I would need different templates/helper-methods for different calling conventions in my own system

Oh it wasn't meant for you but for my generic delegate that accepts both, static and member-function calls using different calling conventions.

Juliean said:
are we now talking about how we ourself handling the calls from bytecode and not how the native functions are declared?

My intention was a little bit of both. But yes, as I used that code to generalize the call from my generic delegate class to an embedded function, regardless of how it is declared, what convention it is and if it is a static or member-function call using this, you could maybe also wrap the call into such a generalized function and instead put that wrapper call onto your call-stack. This is how my delegate does it, instead of pointing to the function intended to be called by the user, it points to such a wrapper function with always the same signature.

You don't take the struct, you just take a pointer to a template function declaring the original function to call as a template argument.

Juliean said:
My “issue” is just that I didn't want to have a wrapper at all :D If I have a global function at address 0x074356af then I wanted my CALL-opcode to be able to issue a CALL-instruction that invokes 0x074356af

I thought about that issue very long in case for my dynamic delegates and also the command line embedded into C++ to be used in game from a simple string to function conversion. TL;DR so far, it is not possible, at least without a lot of template magic at either side because you simply don't have control over the native call-stack that easily without taking some heavy memory shifting and maybe corrupt the stack at some point. So getting a unified call signature and doing some magic in the background is absolutely fine for me at least and as long as your user don't has to write a wrapper function by hand, this should be totally fine.

Don't know if this sounds good to you but for me (as a super picky programmer) this is a solution I'm fine using and the impact is just the overhead to unpack the parameters from your local OpCode stack to the native C++ call-stack

Shaarigan said:
Oh it wasn't meant for you but for my generic delegate that accepts both, static and member-function calls using different calling conventions.

Ah, I get it. I was doing something similar for my delegates, to support lambdas with a capture of at most the size of void*. My trick involves generating a fake lambda and calling that lambdas operator() with the original lambdas payload as the address:

template<typename Functor>
void StoreFunctor(Functor functor)
{
	static_assert(std::is_trivially_destructible_v<Functor>, "Functor must be trivially destructible to be stored in Delegate (meaning its capture consists only of trivially destructible types)");

	const auto redirectCall = [ptr = &functor](Args... args) -> Return
	{
		const auto helper = sys::memoryRead<Functor*>(&ptr);
		const auto functor = *sys::memoryRead<Functor*>(&helper);

		return functor(std::forward<Args>(args)...);
	};

	const auto op = &decltype(redirectCall)::operator();

	m_closure.obj = sys::memoryRead<void*>(functor);
	m_closure.func = sys::memoryRead<FuncPtr>(op);
}

I could probably update the code now with c++20 (memoryRead is pretty much std::bit_cast; and now I could use the lambda directly inside decltype), and I probably don't need two reads to get the functor but I was getting a bit confused from all the memory-indirections and just left it in a work state :D

Shaarigan said:
My intention was a little bit of both. But yes, as I used that code to generalize the call from my generic delegate class to an embedded function, regardless of how it is declared, what convention it is and if it is a static or member-function call using this, you could maybe also wrap the call into such a generalized function and instead put that wrapper call onto your call-stack. This is how my delegate does it, instead of pointing to the function intended to be called by the user, it points to such a wrapper function with always the same signature. You don't take the struct, you just take a pointer to a template function declaring the original function to call as a template argument.

Yep, I think I'm going to do it that way ? So far I did support binding lambdas with captures as well (generated at compilation), so I might need a
separate opcode for that (unless it won't be necessary anymore if all I did that way before could just be expressed as bytecode. but I'll see).

Shaarigan said:
I thought about that issue very long in case for my dynamic delegates and also the command line embedded into C++ to be used in game from a simple string to function conversion. TL;DR so far, it is not possible, at least without a lot of template magic at either side because you simply don't have control over the native call-stack that easily without taking some heavy memory shifting and maybe corrupt the stack at some point. So getting a unified call signature and doing some magic in the background is absolutely fine for me at least and as long as your user don't has to write a wrapper function by hand, this should be totally fine.

Going from a) frobs answer, b) my own understand of ASM and c) the fact thatI belive at least angelscript and a few other languages do it I'm fairly confident that it is at least possible. I could write inline-assembly that accesses some elements from my stack and place it onto the applications stack or some register. But I just don't know the “magic” that goes into doing all that while not using the same stack space or registers myself (pretty much what you describe here as well). So I won't bother with it for now.

Shaarigan said:
Don't know if this sounds good to you but for me (as a super picky programmer) this is a solution I'm fine using and the impact is just the overhead to unpack the parameters from your local OpCode stack to the native C++ call-stack

Its good enough at the very least for now ? Keep in mind that I did like half a day of research before I started, and my default-mindset was that I
was gonna pretty much have to do native-calls. After all you guys said I'm pretty ok with just keep using a wrapper (though I still wouldn't mind to understand the last part that I'm missing about the native-calls).

Why are you building all this interpreter machinery? Can you be clearer about the use case?

Using Mono as an engine for scripts and mods is common and works well. You have to write in C#, but if you're running a bytecode interpreter, any speed advantage for C++ has already been lost. Also, C# doesn't have memory safety issues like C++.

Advertisement

Nagle said:
Why are you building all this interpreter machinery? Can you be clearer about the use case?

Again, the main reason for that is just that thats why I do. I write everything myself with that project. No particular reason. Maybe that I don't like depending on other libraries and enjoy having full control, but thats more of a subjective thing than anything. You might as well ask “why did you write your own platform-GUI-rendering library and didn't use QT? Why did you write your own YAML-parser and didn't use an existing one?”. No real reason. Just that I like doing it that way, and actually enjoy writing stuff like that.
If I wanted to go info full justification mode, than I'd say that everything I've done that way has helped me become the programmer I am now. Writing my GUI taught me the basics about separation of concerns and writing code indirectly (signals/slots, delegates…)… etc.. So in hindsight at least I can say I don't regret doing things that way.

Also, for why using something like C# goes even higher down on my scale, there are actually a few technical reasons:


1) The main feature of my visual-script is that you can write statements like “wait 0.5s” anywhere in code and it will suspend execution of the entire current call (up until the c++-entry of course) and resume exactly where it halted after that time passed, including when called in loops, etc… Also, this is coupled with the ability to write a “Parallel” block which will allow to have multiple resume-commands run at once. While I think that this could probably be expressed in C#/Mono, I'm not sure how exactly (something like a unity-coroutine I assume) and I'm pretty sure its easier for me to have a OpCode::Suspend and then simply store all my callframes myself.

2) I have a few features in my visual-code editor that require/make it easier for me if I have full control over the VM. Like being able to set breakpoints that get executed when the command is reached, while halting the entire game minus the rendering until resumed. Being able to inspect values of local variables at command pins. Representing a callstack/stacktrace that maps to my visual scripts layout.

Nothing that I'm sure cannot be done, but at that point I already feel like I would just be trying to bend C#/Mono backwards to my needs instead of making my own simpler solution (see, thats what I meant with my subjective dislike of external libraries ?�

If you're not going to use v8 or some kind of JITted WASM, then you are just looking for trouble. Don't be afraid of using Python either. Generally, don't use a scripting language that isn't compiled ahead-of-time (again, except JS or Python).

There is no magic formula here for AOT code where you just get everything handed to you, though. Lua makes bindings sort-of easy through the various C++ wrappers, however Lua is notoriously slow. Any AOT bytecode emulator, such as wasm-whatever or someones awful x86 emulator (please don't use that) will have a system call interface that is going to be relatively hard to use.

You can look at my solutions for inspiration on how to solve these things, if you already have an emulator in mind:

You can use a lambda fold to place arguments into registers: https://github.com/fwsGonzo/libriscv/blob/master/lib/libriscv/machine_vmcall.hpp#L10

That will give you a sane call interface that you are used to. Simply machine.vmcall("myfunction", 1, 2, 3, 4);

On the syscall handling side, we do the same thing:

By converting the given types from registers based on the arguments index, we can automatically retrieve data: https://github.com/fwsGonzo/libriscv/blob/master/lib/libriscv/machine_inline.hpp#L162

So, you would do:

long mysyscallhandler(auto&amp; machine) {
    auto [intval, string] = machine.sysargs<int, std::string> ();
    return 55;
}

As far as emulators go, you will want to use either RISC-V or WASM. Both of them have very sane instruction formats, and will run fast enough. It is crucial that the decoding step is as fast as possible. It's the slowest part of emulation. The only thing that matters in such environments is the quality of your system call interface. However, I have no idea how you are going to automatically generate it. If you have LLVM installed you can generate it from AST, similar to how you for example transpile Python to C++.

If you want to go all out, you have to roll your own runtime environment inside the VMs. The simplest way to do that is to build using newlib, which makes all its C functions weak. This will let you override memcpy, strcmp and so on, so that you can accelerate them using system calls. System calls in emulated machines have the performance characteristics of a simple function call, so use them liberally. You should also accelerate heap functionality by creating a custom heap outside the VM to manage the memory inside the VM.

Good luck ?

Kaptein said:
If you're not going to use v8 or some kind of JITted WASM, then you are just looking for trouble. Don't be afraid of using Python either. Generally, don't use a scripting language that isn't compiled ahead-of-time (again, except JS or Python).

Well, to reiterate, I'm not going to use any external scripting language or emulator whatsoever, and I think I have given enough reasons in this thread aside from just doing that because I want to (so if you are really baffeled by this decision you can read my posts in detail). I'm aware of/don't care about the potential downfalls of that approach. It has pretty much the same downsides that writing all my other low-level implemtantions had, which is ok by me.

Kaptein said:
You can look at my solutions for inspiration on how to solve these things, if you already have an emulator in mind:

Yeah, thanks for the links, definately interesting to see how other people solve those things; even though specifically for this case I kind of already have a setup in place; I only need to convert it from dynamically fetching data all over the place to using the new backends stack (for simplicities sake I'm starting with only a stack and no registers unless I encounter an issue that forces me to change that approach; I might later decide to further optimize things by introducing explicit registers).

Kaptein said:
As far as emulators go, you will want to use either RISC-V or WASM. Both of them have very sane instruction formats, and will run fast enough. It is crucial that the decoding step is as fast as possible. It's the slowest part of emulation.

My only objective regarding speed right now is beating my old shitty backend; and there is no doubt I'm going to do that, even without diving heavily into optimization. The old backend had a huge overhead for literally everything. There was nothing like an intrinsic or system-functions that are handled more optimially, everything from adding two ints to inverting a boolean had the overhead of being treated like a function-call normally would. Additionally, fetching arguments and handling returns involved lots of conditions, indirections and jumps through memory. Also a lot of dynamic allocations. So there is no doubt my mind so far that even a rough draft oft my own specific bytecode-language/emulator will yield drastically better performance (even though I'm only like 15% done in terms of features until I can test that now, I'm already how drastically simplified things are all over the place).

Kaptein said:
If you want to go all out, you have to roll your own runtime environment inside the VMs. The simplest way to do that is to build using newlib, which makes all its C functions weak. This will let you override memcpy, strcmp and so on, so that you can accelerate them using system calls. System calls in emulated machines have the performance characteristics of a simple function call, so use them liberally. You should also accelerate heap functionality by creating a custom heap outside the VM to manage the memory inside the VM.

One thing to keep in mind here is that I'm using more of a DSL with my visual-scripting frontend, and no general-purpose scripting language like lua. There is only a limited set of types, and a limited way they can be combined. There is no way and no need for scripts to handle heap-allocations at all (arrays and strings are just handled behind the scenes like they are in the rest of the engine). Objects can only be instantiated engine-side, and the only thing the script-backend has to do is ref-count selected objects to prevent crashes. This pretty much simplifies everything I even need to do.
This also applied to the performance-debate: I don't need close-to-native levels of speed. I don't need arithmetic and loops to be vectorized. The visual-language is not intended to implement things like A* or other computationally intense algorithms, this would go into C++ anyways. I need a backend which doesn't completely suck ass, especially not with a almost geometrical disparity between debug-and non-debug builds (which for example I could solve quite easily with simply always compiling the VM-cpp as release; I couldn't do that with my old backend as part the computationally complex parts were mostly part of the template-instantiations created wherever a script-function was registered.

Based on what you have said so far, I would recommend ANTLR, which is sadly written in Java. However, it's a powerful grammatical tool that you can use to transpile your grammar to a real programming language, such as C++.

https://github.com/antlr/antlr4​https://github.com/antlr/grammars-v4​

With that you can build your simple language. No need for a virtual machine.

This topic is closed to new replies.

Advertisement