Back to General and Gameplay Programming

Double to float C++

Shawn Halayka · 2024-06-09T05:43:58

What arcane magic occurs when a double is cast as a float?

General and Gameplay Programming Programming

Started by taby May 13, 2024 04:27 PM

181 comments, last by JoeJ 6 months, 1 week ago

JoeJ

4,359

May 20, 2024 08:51 PM

taby said:
Any thoughts on how to achieve snapping-to without casting?

double plank = 0.0000000000000000000000000001;
double quantizedValue = round(value / plank) * plank;

But no. Nature does not do this. It can't be. : )

taby

Author

1,539

May 20, 2024 11:01 PM

I tried your code, as well as a code that is similar. They do not reproduce the results.

https://stackoverflow.com/a/70221868/3634553?stw=2

Hmm…

Thank you joej!!!!

taby

Author

1,539

May 20, 2024 11:56 PM

This works, but it's still implicitly casting:

double truncate_normalized_double(const double d)
{
	float a = d - numeric_limits<float>::epsilon();
	float b = d + numeric_limits<float>::epsilon();

	float r1 = abs(d - a);
	float r2 = abs(d - b);

	if (r1 < r2)
		return static_cast<double>(a);
	else
		return static_cast<double>(b);
}

taby

Author

1,539

May 21, 2024 12:56 AM

This also works. I do the cast from double to float, which is not absolutely necessary – I could convert them using stringstreams manually, which is just slow.


double truncate_normalized_double(const double d)
{
	if (d <= 0.0)
		return 0.0f;
	else if (d >= 1.0)
		return 1.0f;

	float df = d;

	float tempf = nexttowardf(1.0f, df);

	while (tempf > df)
		tempf = nexttowardf(tempf, df);

	return static_cast<double>(tempf);
}

taby

Author

1,539

May 21, 2024 01:16 AM

As you can imagine, I’m sure, that this problem vexes me to no end. LOL

Thanks to everyone who had input here!

taby

Author

1,539

May 21, 2024 01:25 AM

I need a number library that lets you specify the bits. ttmath uses words, like 64-bit words on x64 architecture. i need to specify the exacf number of bits.

JoeJ

4,359

May 21, 2024 06:40 AM

taby said:
I need a number library that lets you specify the bits.

Do it yourself:

float value = float(PI);
			for (int shift = 0; shift < 23; shift++)
			{
				uint32_t bits = (uint32_t&) value;
				uint32_t mantissa = bits & 0x007FFFFF;
				uint32_t signExp = bits & ~0x007FFFFF;
				mantissa &= -1<<shift;
				bits = (mantissa | signExp);
				float reduced = (float&) bits;
				ImGui::Text("PI reduced by %i bits: %f mantissa: %x", shift, reduced, mantissa);
			}

The idea is to extract mantissa and zero out n right most bits.

You could do this for any double after reading up how many bits it uses for mantissa and using uint64 ofc.

JoeJ

4,359

May 21, 2024 06:44 AM

My code was over complicated. No need to mask out mantissa. Same result:

for (int shift = 0; shift < 23; shift++)
			{
				uint32_t bits = (uint32_t&) value;
				bits &= -1<<shift;
				float reduced = (float&) bits;
				ImGui::Text("PI reduced by %i bits: %f", shift, reduced);
			}

taby

Author

1,539

May 21, 2024 02:36 PM

I tried this, but it doesn't work. Any ideas what I'm doing wrong?

Edit: I updated the code… still doesn't work though. It only starts working like halfway through.

#include <iostream>
#include <iomanip>
using namespace std;

int main(void)
{
	cout << setprecision(20) << endl;

	const double pi = 4.0 * atan(1.0);

	uint64_t mantissa_size = 52;

	for (uint64_t shift = 0; shift < mantissa_size; shift++)
	{
		uint64_t bits = (uint64_t&)pi;
		bits = bits & (4294967295 << shift);
		double reduced = (double&)bits;
		cout << shift << " " << reduced << endl;
	}

	return 0;
}

Double to float C++

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Double to float C++

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines