taby said:
bits &= -1 << shift;
Probably it treats the -1 as a 32 bit integer.
Try bits &= -1ull << shift;
Or bits &= uint64_t(0xFFFFFFFFFFFFFFFFull) << shift;
This sucks. I'm never sure and it often causes me bugs. More bits, more trouble.
taby said:
bits &= -1 << shift;
Probably it treats the -1 as a 32 bit integer.
Try bits &= -1ull << shift;
Or bits &= uint64_t(0xFFFFFFFFFFFFFFFFull) << shift;
This sucks. I'm never sure and it often causes me bugs. More bits, more trouble.
Sorry, yes, I was using the wrong value for the shifting. I’ll try it tonight when I get home. 🙂
thanks again joej!
Yes the problem was me. I got it working. Now to test if it works the way we want it to!
#include <iostream>
#include <iomanip>
using namespace std;
int main(void)
{
cout << setprecision(30) << endl;
double pi = 4.0 * atan(1.0);
const int64_t mantissa_size = 52;
uint64_t max = static_cast<uint64_t>(-1); // 2^64 - 1
for (int64_t shift = 0; shift < mantissa_size; shift++)
{
uint64_t bits = reinterpret_cast<uint64_t &>(pi);
bits = bits & (max << shift);
double reduced = reinterpret_cast<double &>(bits);
cout << shift << " " << reduced << endl;
}
return 0;
}
I tried it out. It still does not snap to the closest float. :(
In fact, if I shift before I cast back to float, it doesn't work. So, shifting can't be the solution. I really appreciate your hard work joej! Sorry man.
Interesting, because i'm quite certain that's what a conversion from double to float is doing - clipping less significant bits. But ofc. from both mantissa and exponent.
Maybe there is rounding before the clip as well. Or your values are close to a change in exponent. Something like that is probably missing.
You can search for code examples to convert float to half (fp16). There CPU is no instruction for that, so examples should be plenty.
thanks again for the ideas. Yes I never thought to check fp16 conversion. You’re a saviour, man!
I found this code, which might be helpful:
https://gamedev.stackexchange.com/a/17329/149713
#define F16_EXPONENT_BITS 0x1F
#define F16_EXPONENT_SHIFT 10
#define F16_EXPONENT_BIAS 15
#define F16_MANTISSA_BITS 0x3ff
#define F16_MANTISSA_SHIFT (23 - F16_EXPONENT_SHIFT)
#define F16_MAX_EXPONENT (F16_EXPONENT_BITS << F16_EXPONENT_SHIFT)
GLushort F32toF16(GLfloat val)
{
GLuint f32 = (*(GLuint *) &val);
GLushort f16 = 0;
/* Decode IEEE 754 little-endian 32-bit floating-point value */
int sign = (f32 >> 16) & 0x8000;
/* Map exponent to the range [-127,128] */
int exponent = ((f32 >> 23) & 0xff) - 127;
int mantissa = f32 & 0x007fffff;
if (exponent == 128)
{ /* Infinity or NaN */
f16 = sign | F16_MAX_EXPONENT;
if (mantissa) f16 |= (mantissa & F16_MANTISSA_BITS);
}
else if (exponent > 15)
{ /* Overflow - flush to Infinity */
f16 = sign | F16_MAX_EXPONENT;
}
else if (exponent > -15)
{ /* Representable value */
exponent += F16_EXPONENT_BIAS;
mantissa >>= F16_MANTISSA_SHIFT;
f16 = sign | exponent << F16_EXPONENT_SHIFT | mantissa;
}
else
{
f16 = sign;
}
return f16;
}
Edit:
This works great! frexp and copysign for the win!
double truncate_normalized_double(double d)
{
if (d <= 0.0)
return 0.0f;
else if (d >= 1.0)
return 1.0f;
double result = 0;
int exponent = 0;
double s = signbit(d);
result = frexp(d, &exponent);
const double d_final = result * pow(2.0, static_cast<double>(exponent));
return copysignf(d_final, s);
}
It’s still not what I need. To make things simple, the range is from 0 through 1, so the exponent is always zero. I’ll be working on it all night lol
Sorry, it doesn't quite work. Surely I'm missing something obvious!?
#include <iostream>
#include <iomanip>
#include <string>
#include <bitset>
using namespace std;
void get_truncated_bit_string(double d, string &s)
{
s = "";
for (int i = 63; i >= 0; i--)
{
if (i <= 31)
s += '0';
else
s += to_string((reinterpret_cast<uint64_t&>(d) >> i) & 1);
}
}
void get_double_bit_string(double d, string& s)
{
s = "";
for (int i = 63; i >= 0; i--)
s += to_string((reinterpret_cast<uint64_t&>(d) >> i) & 1);
}
double truncate_normalized_double(double d)
{
//return static_cast<double>(static_cast<float>(d));
string sd = "";
get_double_bit_string(d, sd);
cout << sd << endl;
std::bitset<64> Bitset64(sd);
uint64_t value = Bitset64.to_ullong();
double dv = reinterpret_cast<double&>(value);
string sdv = "";
get_truncated_bit_string(dv, sdv);
cout << sdv << endl;
double df = static_cast<double>(static_cast<float>(d));
string sdf = "";
get_double_bit_string(df, sdf);
cout << sdf << endl;
return dv;
}
int main(void)
{
cout << setprecision(20) << endl;
for(double d = 0.0; d <= 1.0; d += 0.1)
cout << truncate_normalized_double(d) << endl << endl;
return 0;
}