*"...I was expecting more of a fight to access the float hardware." - moi*

How young and naive I was back then! ;-) Sure, you can trivially extract the parts that make up a double (sign, exponent, & mantissa) but they are in binary. Which is fine if you want to display binary or hex (oh how I wish the "natural" base we had "picked" to do our arithmetic in were 4), but for decimal you have to convert a base 2 float to a base 10 float, which is fraught with precision issues. So fraught, apparently, that many programming language math packages got them subtly wrong. It seems that C++ now has the double string conversions right, but the crux of the issue for this project is output formatting, which I don't believe is sufficient in C++ to supply an HP calculator type display. In particular, I don't see a direct way to do engineering (mantissa % 10^3) type formatting, or specific formatting of hex floats in terms of precision or "decimal" places .

A double has 64 bits: 1 sign bit, 11 exponent bits, and 52 mantissa bits. For values that aren't subnormal, the mantissa is prepended with a 53rd MSb of '1'. Manipulating the 52 bit mantissa via doubles is problematic because the precision that the double can store either matches the mantissa (for subnorms) or is one bit less (norms). There are some "guard bits" in the float hardware, and these increase the precision of the result via rounding when shoehorning it back into the double storage space of 64 bits, but there is no way to insert anything into these guard bits as input.

Floats do a lot of nice stuff automatically, but float containers are rather vague and limited in C++. There is a "long double" but there is no guarantee that it has more precision than a double, which is kinda nuts. (I believe the double itself can fall back to just a float in some cases, which is really nuts, though apparently knowable.) The long double "usually" uses the 80 bit float hardware in an x86 processor, but all bets are off with portability. This is one major flaw in C++ IMO. I understand that it's based on C legacy, which did "best effort" with ints, longs, etc. regarding the hardware, but they should have totally nailed down the widths via typing (which they did after the fact, at least for the signed and unsigned integers: int32_t, uint64_t, etc.) so that precision and modulo behavior would track among the various underlying hardware, and then provided library implementations of things not directly supported by the target hardware. I mean, how hard / not obvious is this to know that it's the right thing to do? Math has to be 100% portable for algorithms not to break. Though, I guess when your main memory is 64k, it's hard to see the big picture or the needs of the future.

I could have used C++ functions that convert doubles to strings, extract the various fields into sub strings, then convert these back into separate numbers, then play with them, then convert them back into display strings, but that seems really Rube Goldberg-ian. So I bit the bullet and wrote my own double fields extractor, and it seems to be working OK. I used one long double, but if it falls back to a double the precision shouldn't be affected too much. It's not 100% accurate, but the results seem to be good to several least significant bits. Given the 53 bit mantissa of a double, there are log10(2^53) = 15.9546, or ~16 significant decimal digits in a double. The signed exponent is returned as a int32_t, the signed mantissa as a long double. A separate function looks for NaNs and infinities.

**[EDIT]** I can see now why many calculator SW writers often end up implementing their own BCD math libraries.

Here's the gist of the conversion:

Code:

d_fields(d_i, sf, ef, ff); // get d fields
int32_t exp_2 = ef - EF_BIAS; // remove exp_2 bias
double mant_2 = d_i; // the whole input double, actually
int32_t exp_10 = ceil(exp_2 * log10(2)); // initial exp_10
mant_2 *= pow(10.0, -exp_10); // base 2 => 10 conversion
d_fields(mant_2, sf, ef, ff); // get mant fields
mant_2 = (!ef) ? ff : ff + FF_MSB; // (ef == 0) : subnorm
exp_2 = ef - EF_BIAS;
long double mant_10 = (long double)mant_2 * powl(2, exp_2 - 52);
// adjust if not base 10 exponential form (mostly for subnorms)
while ( mant_10 = 10 ) { mant_10 /= 10; exp_10++; }
mant_o = powl(-1, sf) * mant_10;
exp_o = exp_10;