Let's stop the madness over here too.
From: 60 Miles North of San Diego, CA
Let's stop the madness over here too.
Floating Point Addition
Floating point addition is fairly nuanced. Because addition is essentially little endian, the decimals of the inputs need to be aligned before the addition can take place, which in the float world is accomplished by right shifting the significand of the input with the smaller exponent a distance equal to the difference of the exponents.
A good place to start is with normalizing the inputs before denorming the one with the smaller exponent. Zero is a denorm that can confound things here because it will test larger than a non-zero input with negative exponent, so we need to examine each input for zero significand and deal with it somehow. One way is to replace the exponent with a very negative value, so zero will always lose the battle of the exponents.
Next we need to compare the input exponents by subtracting one from the other. Because Hive shift distances are modulo, we need to somehow limit the shift range to the interval [-32:0]. One tricky way to do this is to zero out the significand if the shift distance is greater than -32. Shifting the value zero any distance will always return zero, and subsequently adding this zero to the other signifcant won't change it, which is what would happen if it were actually shifted -32 or beyond.
Then we do a signed addition on the significands with a check for over and underflow, followed by normalization of the result if necessary, and the usual bounds checks on the output exponent. A zero result should return both zero significand and zero exponent.
The output exponent is simply the largest of the two input exponents, given input and output normalization.
I'm reaching the point where a lot of this stuff is becoming boilerplate, which is good.
Found this fantastic paper this morning via Hacker News and am in the process of digesting it:
Read it and weep! Laundry lists of fail regarding every popular ISA (instruction set architecture) are laid bare in plain English. Literals being given too much opcode space in almost every ISA is an issue I've really grappled with (and IMO effectively dealt with via a tiered method) in Hive. Surprisingly even ARMv8 (the ISA everyone is currently hitching their wagons to) substantially sucks. The paper rightly tears the x86 ISA a new one, it's absolutely horrid.
Like some newer programming languages which pare things down by purposely omitting various features, I think it's super important to keep the processor as simple as humanly possible. ISA design is really hard work, but it's amazingly poorly done for the vast impact computing has on the world. And we really need open standards here.
[EDIT] I like the way they handle different width immediates, by breaking them up and reassembling the pieces in a way that doesn't require hardware muxing (though muxing here isn't really that onerous). I'd love to use that idea but can't. They also picked little endian, which tells me they aren't completely insane.
Logs are exponents - it's too bad the term exponential is taken.
Log base 2 is the exponent of 2 required to produce the input value. So it's a work backwards thing:
2^0 = 1 so log2(1) = 0
2^1 = 2 so log2(2) = 1
2^2 = 4 so log2(4) = 2
Inputs less than one produce negative results:
2^-1 = 1/2 so log2(1/2) = -1
2^-2 = 1/4 so log2(1/4) = -2
It seems log2(0) equals negative infinity (negative overflow).
If the input is a float with significand S and exponent E:
log2(S * 2^E) = log2(S) + log2(2^E) = log2(S) + E * log2(2) = log2(S) + E * 1 = log2(S) + E
log2(S * 2^E) = E + log2(S)
The output significand is basically the input exponent, plus a positive fractional offset caused by the input significand. The output exponent is simply the distance required to normalize the output significand. This works for any positive input.
As in the EXP2 case, observation of the LOG2 graph reveals the fact that we can scale a small portion to cover all input and output scenarios by compressing the input by appropriate powers of 2 (manipulation of the float exponent). First we subtract 1 from the input to move the curve down to the origin of both axes, and mulitply it by 2 to better fit the calcuation space. This also makes the polynomial approach fundamentally tractable. The polynomial coefficients are adjusted so that they also fit in the calculation space, which gives an polynomial output that is 1/2 the expected value.
What to do with negative inputs? I rashly decided to take the absolute value of the input. Zero input is the only scenario that can cause (negative) overflow, all other inputs should easily fit in the output space.
Here is an illustrative example of how to handle the input value 12.25 (where the significand width is 8 and the exponent width is 4):
START 01100010 0101
ABS & NORM 11000100 0100
CHVAR 10001000 0011
NORM (signed) 01100000 (SHL 5)
POLY 00010010 (SHL 5-7) (7 here instead of 8 to make up for poly output of 1/2 expected value)
EXP_o 0011 (8-5)
END 01110010 0011
So the result is (114 / 256) * (2^3) = 3.5625 (the actual precise result is 3.6147...)
The polynomial requires 11 terms for minimal error.
COS - Float (via polynomial)
Since cosine is cyclic, we have to figure out where in the particular cycle the input float value happens to be. This is accomplished by multiplying the input value by the inverse of the cycle (1/2*pi) which gives us a signed integer portion that represents the cycle count from zero, and a signed fractional portion which represents the location within the given cycle. We discard the integer portion because it gives us no intra cycle information, and present the fractional portion as input to the cosine polynomial as a signed integer. The polynomial result is also a signed integer, which is requires conversion to float.
Note that larger input values devote more information to the cycle count and less to the location within that cycle, so the significance of the result suffers. In the extreme, the result has no mathematical relation to the input. For instance, maximum input for the native cosine in Excel is (2^27)-1 and anything larger gives a #NUM! result, presumably because the underlying polynomial also employs something like 32 bit integers, and 32 - 27 = 5 remaiing bits of precision, which ain't much. In the larger sense this issue is due to the non-ideal mapping of the semi-logarithmic float domain to the linear domain of the integers and cosine. It's hard to know what to do with this scenario without signaling, but it's clear that declaring the result invalid after its precision sufficiently poops out is the right thing to do.
For implementation, I found it most straightforward to:
1. take the absolute value of the input and normalize it;
2. multiply by 2/pi in order to maximize the precision of the extended multiplication and subtract 2 from the exponent to compensate (*1/4);
3. unsigned shift the significand the distance of the exponent;
4. multiply this by the input sign in order to restore its signedness;
5. plug this into the COS2_POLY routine;
6. normalize the result as a float.
The shift in step 3 above should be limited so as not to exceed the hardware shift distances (+31,-32).
It isn't necessary to check for zero significand at the input as multiplying and shifting a zero will always yeild a zero result. However, after the normalization / conversion to float in step 6 above, the output exponent should be zeroed out if the output significand is zero (to squash the generation of "dirty zeros" here).
SIN - Float (via polynomial)
To implement sine float, the above routine simply calls SIN2_POLY in step 5 (which offsets the polynomial input by -0x4000,0000 and calls COS2_POLY).
I realize I haven't posted the spreadsheet I've been developing and testing these algorithms in:
Rather intentional as I never feel it's really complete enough, but it's getting there.
Hive Assembly Language (HAL)
The Hive simulator displays disassembled code as pseudo code, and I generally write code in my notebook in pseudo code form, so it seems the time is ripe to write an assembly language interface. First step is to make a *.mif to *.hal file converter (which provides fodder for the reverse conversion). Most of the functionality already exists in the simulator code, so it's easy peasy lemon squeezy (took me all of 4 hours or so, most of it spent making the output pretty). For instance, here is the unsigned integer div/mod routine in HAL converted from the MIF source:
0x560 (s0!=0) ? pc += 0x3 // DIV_MOD_U SUB START - check a
P0 := 0xff // max q
P1 := 0 // zero r
pc := P7 // return
s2 := lzc(s0) // x := pow[lzc[a]]
P2 := 1 << s2
s0 *= 0x3f // na := -a
s0 *= s2 // 1x Newton Raphson
P0 *u= s2
P2 += P0
s0 *= s2 // 2x
P0 *u= s2
P2 += P0
s0 *= s2 // 3x
P0 *u= s2
P2 += P0
s0 *= s2 // 4x
P0 *u= s2
P2 += P0
P0 *= s2 // 5x
P0 *u= s2
P2 += P0
P2 *u= s1 // q := b * x
s2 *= s0 // r := b - (q * a)
P1 -= P2
s1 -= s0 // 1x adjust q & r
(P1<0) ? pc += 0x6
P1 -= s0
P2 += 0x1
s1 -= s0 // 2x
(P1<0) ? pc += 0x2
P1 -= s0
P2 += 0x1
P0 := P2 // move q
0x582 pc := P7 // DIV_MOD_U SUB END
(It looks pretty, oh so pretty...)
Now for the harder part: implement a token parser for this language and have it kick out a MIF file. Though a lot of that type of code already exists in the simulator to parse MIF files and the simulator command line.
Two weeks later (!) I've got the HAL to MIF parser finished, which means I can now develop in Hive assembly language. This is nice because assembly is pretty much how I already write and document the code, and the manual (and therefore error-prone) translation steps can be eliminated. Along with the usual end-of-line comments, HAL also supports block comments, so the commenting can be more in-depth and descriptive, and whole chunks of code can be quickly swapped in and out for debug, or kept in reserve for future use. Currently the *.hal <=> *.mif conversion is done via command prompt, I'm not sure yet how much I'll eventually integrate it into the simulator.
The main hold-up was figuring out how to deal with the various representations of literals. A signed 8 bit value is an immediate field in the OP_BYT opcode, signed and unsigned 16 bit values are single in-line literals following the OP_LIT_S and OP_LIT_U opcodes respectively, and 32 bit values are two in-line literals following the OP_LIT opcode. I wanted to use the same HAL notation for all three (e.g.: s2 := 0x89) and let the assembler figure out the best fit based on the value, which turned out to be rather tricky. If the value fits a signed byte and there is no extra pop (OP_BYT is a single operand instruction) then OP_BYT is used. If not, then if the value fits in signed 16 bits OP_LIT_S is used. If not, then if the value fits in unsigned 16 bits OP_LIT_U is used. Otherwise it's 32 bits and OP_LIT is used. This goes smoothly from signed to unsigned to moot, so it's easy to remember. To not break existing code that may rely on relative and absolute addressing, the parser adds NOPs as needed.
Another hold-up was in moving the interrupt controller closer to the opcode decoder and removing register set support for it. It is now controlled via one-hot (one bit per thread) dedicated opcodes. The interrupt controller has been one of the most volatile elements during the design of Hive, I'm hoping it stays put now.
Finally, I got rid of the explicit OP_NOP and moved JMP_8 down to the base of the opcode space. JMP_8 with a zero immediate value gives an opcode that is zero, and a zero jump doesn't do anything at all. There are a couple of other candidates that also do nothing under various circumstances, but having unconditional JMP_8 at the bottom of the opcode space and JMP_8* conditionals at the top neatly bookends things. However, the lack of an explicit NOP threw a wrench into the simulator opcode decode, so that took some thought and work to get past.
With most of these structural changes, the core as synthesized for the FPGA has actually been shrinking a bit (utilizing fewer resources) which is not exactly expected, but is nice. I'm thinking of moving to the EP4CE10 device, which would double the main memory space (16 BRAMS => 32, for a whopping 32KB ;-) and can be had on eBay for <$40.