Advertisement

CALL-instruction to constant 8-byte address

Started by April 25, 2022 03:48 PM
12 comments, last by Beosar 2 years, 7 months ago

Ok, back with another JIT-related question.

In my custom machine-code, I need to be able to embedd calls to native functions that are part of the compiled program. The easiest working way I found is this:

 void JitGenerator::WriteCallFunctionPointer(const void* pAddress)
 {
 	WriteStoreConstant(std::uintptr_t(pAddress), RegisterType::RAX);
 	WriteLine({ 0xff, 0xd0 }); // call rax
 }
 
 void test(void) {}
 
 WriteCallFunctionPointer(&test);

However, this is using an “indirect” call, which is slower (ranging from neglegible to exterme depending on the branch-prediction as far as I've read). So I'd like to change this to use a direct call instead, however I'm not 100% sure about how I'd go about changing it. From the documentation (https://www.felixcloutier.com/x86/call),​ it seems I'm looking for:

E8 cd => Call near, relative, displacement relative to next instruction. 32-bit displacement sign extended to 64-bits in 64-bit mode.

So basically I'd take the address I have and convert it to an offset relative to the address of the start of the next instruction, like you'd do to reference cdata.
But later on, it says:

Near Call — A call to a procedure in the current code segment (the segment currently pointed to by the CS register), sometimes referred to as an intra-segment call.

So thats where I'm a bit lost. What exactly does CS contain? I'm feeling a bit dumb but I've not been able to find a definitive answer. Is a “code segment” the address range for a given piece of executable code, ie. the compiled exe, as opposed to the page I allocate for my JIT-code? If so, is there any better alternative than indirect-call? I'm assuming that "9A cp" while not specifying “indirect” still having to fetch the address from the "ptr16:32" will be equally slow, but I'm not sure.

If CS actually means something different and I can just take the difference of address of my function to the next instruction in the JIT-code, is there any way to ensure that the memory is always in a region that is near the executable? On a 64-bit machine with a 32-bit register, if both values are 2^32 apart from each other than I couldn't use a near call eigther way.

Those are also all just my presumptions anyways. Anybody with a bit more knowledge about this topic able to anser the question?

I'm not sure exactly what happens in the machine code myself. However I know for a fact that you can code a function pointer as a template parameter, as I've done this many times. This implies to me that it should use the fastest call possible. I don't know if that helps or not.

Advertisement

Gnollrunner said:
I'm not sure exactly what happens in the machine code myself. However I know for a fact that you can code a function pointer as a template parameter, as I've done this many times. This implies to me that it should use the fastest call possible. I don't know if that helps or not.

I'm assuming that for this case, the compiler will simply resolve the template-parameter at compile-time internally, and then obviously use the fastest available method it has available. Though this still seems to imply to me that it'll only do that based on what the actual content of the function-pointer is, and if the “same CS” applies. I'll try to see if this somehow helps with decyphering what the compiler does, but I'm guessing it will just replace it with a normal call as-if you had called the function in that place, which probably doesn't solve my issue, but thanks so far anyways ?

Near/Far calls are a throwback to 16-bit days.

I believe — but would want to double-check the authoritative Intel reference manuals — that the near call E8 in all modern, flat memory processes only changes the ip/eip/rip register, so wherever your process was running it will continue to run. If you were in the older segment:offset model the CS register would be unchanged. In contrast the far call does both, so you're pushing/popping two registers. and you're decoding a few additional bytes.

I doubt the performance difference would be significant here, especially in the modern OOO core. With branch prediction and the obvious non-conditional jump, the added effort of popping CS is unlikely to stall the pipeline. You might decode an additional instruction or two, but that's saving perhaps a CPU cycle on decoding, so 0.2 ns or so? Without measuring and reading the reference manuals closely (a dangerous practice in CPU timings!) my guess is if you're doing the jump, you're more likely going to have a penalty from an instruction cache miss more than anything. This feels like an extreme micro-optimization, but if you know it better in one scenario then go for it. My gut tells me the cache effects will dominate over the opcode choice. Either near or far, if you're jumping inside the instruction cache you'll likely get it as amortized free, and if you're outside the cache it will be whatever the cache miss costs.

frob said:
I believe — but would want to double-check the authoritative Intel reference manuals — that the near call E8 in all modern, flat memory processes only changes the ip/eip/rip register, so wherever your process was running it will continue to run. If you were in the older segment:offset model the CS register would be unchanged. In contrast the far call does both, so you're pushing/popping two registers. and you're decoding a few additional bytes.

I'm not that much worried about the near/far-difference, but more that I'm using the “indirect” call:

FF /2 => CALL r/m64 => Call near, absolute indirect, address given in r/m64.

Since there doesn't seem to be any far-direct call. Though I'm just noticing that apparently with that I'm already doing a near-call so converting it to the direct-near call as long as addresses are within range should be possible.

frob said:
I doubt the performance difference would be significant here, especially in the modern OOO core. With branch prediction and the obvious non-conditional jump, the added effort of popping CS is unlikely to stall the pipeline. You might decode an additional instruction or two, but that's saving perhaps a CPU cycle on decoding, so 0.2 ns or so? Without measuring and reading the reference manuals closely (a dangerous practice in CPU timings!) my guess is if you're doing the jump, you're more likely going to have a penalty from an instruction cache miss more than anything. This feels like an extreme micro-optimization, but if you know it better in one scenario then go for it. My gut tells me the cache effects will dominate over the opcode choice. Either near or far, if you're jumping inside the instruction cache you'll likely get it as amortized free, and if you're outside the cache it will be whatever the cache miss costs.

Yeah its probably a micro-optimization, but since I can implement it once in the Jit-compiler and it applied to all the function calls, it might be worth it. The whole JIT-thing at this point is a bit of an over-optimization, my game does run fast enough on the users machine even with the interpreter. I do have a few editor-features that rely on fixed-timestep turbo-mode for faster (automated) testing, which is why I'm even doing it right now in the first place.

You probably want to use E8 rather than FF; a near call within the one and only segment your program uses, with a directly specified address. Far calls into other segments are usually unneeded. Don't assume placing the address inline is cheaper than placing the address in a register.

Omae Wa Mou Shindeiru

Advertisement

LorenzoGatti said:
You probably want to use E8 rather than FF; a near call within the one and only segment your program uses, with a directly specified address. Far calls into other segments are usually unneeded. Don't assume placing the address inline is cheaper than placing the address in a register.

So does that mean that the custom pages I allocate for executing my JIT-code are also part of the the “one and only” segment my program uses? Thats the part that I was not sure about, what actually defines that segment. If thats how it works then yes I'll be just doing a near-call.

Ok, so I tried it out and it actually worked with E8-near call and directly embedding the offset to the function. The main problem I had was making sure that the JIT-compiled code is actually in a reachable range via rel32. The solution I have seems a bit hacky, so maybe somebody knows a better way:

	const auto startAddress = [&]() -> void*
	{
		AE_ASSERT(!data.vFunctionPointers.IsEmpty());

		const auto startAddress = *sys::maxElement(data.vFunctionPointers, [](JitData::FunctionPointerData data)
		{
			return data.pFunction;
		});

		return (char*)startAddress.pFunction - sys::MAX_VALUE<int32_t> / 2;
	}();

	auto* pMemory = (char*)VirtualAlloc(startAddress, codeSize, MEM_COMMIT | MEM_RESERVE | MEM_TOP_DOWN, PAGE_READWRITE);

So I'm just finding the highest address of a native-function inside the JIT-code, then offsetting it by a value halfway between what an int32_t can hold. This seems to always give me a valid base address that is within a range to that I can convert

rel32 = <NATIVE_FUNC> - <INSTRUCTION_ADDRESS>

I've also tried using

GetModuleHandle(NULL)

To get the base address space of the executable, but this was still too large to be able store the difference between those addresses in an int32_t. Any other ideas how I could get a base-address in the desired range?

(But as an aside, optimization or not, having just one instruction instead of two for each call makes it much easier to debug and read, so thats already a plus).

With address randomization? No, you can't guarantee they'll be in any particular place.

I think the only way you could guarantee it is if you're only within a single module and generating the code all at once.

Anything else that isn't in the same blob of generated code and all bets are off. High entropy address randomization is enabled by default for executables, you can disable to make it low entropy and also disable dynamic base addresses, but they can trigger security systems. Otherwise each library's addresses can be anywhere in the 64-bit address space.

frob said:
I think the only way you could guarantee it is if you're only within a single module and generating the code all at once. High entropy address randomization is enabled by default for executables, you can disable to make it low entropy and also disable dynamic base addresses, but they can trigger security systems. Otherwise process addresses can be anywhere in the 64-bit address space.

I was aware of the addresses being different, I was rather interested in if there is a way to query the base-address at runtime, unlike my rather hackyish-approach of evaluating just one of the many native function-addresses. The process-addresses should at least be relatively in a condensed range, shouldn't they? Because otherwise an application couldn't be using the E8-encoding if the address may randomly be larger than 4 byte of range apart.

This topic is closed to new replies.

Advertisement