Advertisement

Double to float C++

Started by May 13, 2024 04:27 PM
181 comments, last by JoeJ 6 months, 1 week ago

taby said:
Any thoughts on how to achieve snapping-to without casting?

double plank = 0.0000000000000000000000000001;
double quantizedValue = round(value / plank) * plank;

But no. Nature does not do this. It can't be. : )

I tried your code, as well as a code that is similar. They do not reproduce the results.

https://stackoverflow.com/a/70221868/3634553?stw=2

Hmm…

Thank you joej!!!!

Advertisement

This works, but it's still implicitly casting:

double truncate_normalized_double(const double d)
{
	float a = d - numeric_limits<float>::epsilon();
	float b = d + numeric_limits<float>::epsilon();

	float r1 = abs(d - a);
	float r2 = abs(d - b);

	if (r1 < r2)
		return static_cast<double>(a);
	else
		return static_cast<double>(b);
}

This also works. I do the cast from double to float, which is not absolutely necessary – I could convert them using stringstreams manually, which is just slow.


double truncate_normalized_double(const double d)
{
	if (d <= 0.0)
		return 0.0f;
	else if (d >= 1.0)
		return 1.0f;

	float df = d;

	float tempf = nexttowardf(1.0f, df);

	while (tempf > df)
		tempf = nexttowardf(tempf, df);

	return static_cast<double>(tempf);
}

As you can imagine, I’m sure, that this problem vexes me to no end. LOL

Thanks to everyone who had input here!

I need a number library that lets you specify the bits. ttmath uses words, like 64-bit words on x64 architecture. i need to specify the exacf number of bits.

Advertisement

taby said:
I need a number library that lets you specify the bits.

Do it yourself:

float value = float(PI);
			for (int shift = 0; shift < 23; shift++)
			{
				uint32_t bits = (uint32_t&) value;
				uint32_t mantissa = bits & 0x007FFFFF;
				uint32_t signExp = bits & ~0x007FFFFF;
				mantissa &= -1<<shift;
				bits = (mantissa | signExp);
				float reduced = (float&) bits;
				ImGui::Text("PI reduced by %i bits: %f mantissa: %x", shift, reduced, mantissa);
			}

The idea is to extract mantissa and zero out n right most bits.

You could do this for any double after reading up how many bits it uses for mantissa and using uint64 ofc.

My code was over complicated. No need to mask out mantissa. Same result:

for (int shift = 0; shift < 23; shift++)
			{
				uint32_t bits = (uint32_t&) value;
				bits &= -1<<shift;
				float reduced = (float&) bits;
				ImGui::Text("PI reduced by %i bits: %f", shift, reduced);
			}

I tried this, but it doesn't work. Any ideas what I'm doing wrong?

Edit: I updated the code… still doesn't work though. It only starts working like halfway through.

#include <iostream>
#include <iomanip>
using namespace std;

int main(void)
{
	cout << setprecision(20) << endl;

	const double pi = 4.0 * atan(1.0);

	uint64_t mantissa_size = 52;

	for (uint64_t shift = 0; shift < mantissa_size; shift++)
	{
		uint64_t bits = (uint64_t&)pi;
		bits = bits & (4294967295 << shift);
		double reduced = (double&)bits;
		cout << shift << " " << reduced << endl;
	}

	return 0;
}

This topic is closed to new replies.

Advertisement