# Let's Design and Build a (mostly) Digital Theremin!

(^^^: yeah, one of my co-workers used to call me "perfessor" - though I don't feel my mental meaderings are really at that level...)

**Floats...**

Still thinking about floats, though it's winding down. Had to really look into this feeling I was getting that there was some optimal representation for both storage and calculation, but I now believe that feeling is groundless. In fact, it is the differing needs of storage and calculation that make a single optimally efficient float representation for both an unlikely prospect.

If you implement floats in software you see a lot of the same stuff going on almost regardless of the function. Zeros are denorms, so there's the checking for input zeros and the handling of those special cases. If there are two input values there may be some comparisons between them to pick the right path. Then there's the central function itself performed via iteration, polynomial, etc. The output often needs to be normalized, and if output zero is possible it must be dealt with somehow (conceptually quite a tough nut in a fundamentally exponential representation). Float packing and unpacking overhead can take as much time as the algorithm itself, really slowing things down, so you either have special hardware do pretty much everything or nothing. But even though the steps for the various functions are very similar, they aren't completely similar, and this really complicates the floating point hardware.

Floats are nasty but almost impossible to avoid. And I'm in the position where I could actually add the hardware if I wanted to. The fact that an unpacked float is composed of three values (sign, exponent, magnitude) makes any ALU or other specialized hardware that operates on them slow and problematic from an I/O standpoint. For example, adding two unpacked floats via hardware would require 6 writes and 3 reads, or 9 cycles minimum - whereas the software approach takes ~19 cycles maximum.

So I believe the best software solution is to employ a float representation that is primarily efficient when it comes to calculation, and pack/unpack them to/from memory if space is a problem.

Almost all algorithms operate on magnitude & sign separately, and this includes those which you might imagine would work better with natural 2's complement I/O, such as addition and multiplication. For instance, the addition of two non-negative numbers can overflow but not underflow, but subtraction is the opposite. At the start of all this I recall spending weeks looking for a division algorithm that would natively work on 2's complement numbers, only to come up short. I thought those who recommended the sign & magnitude approach (convert to unsigned, divide, restore the sign) were possibly just being lazy, but that approach is actually most efficient and consistent with most (possibly all) other functional algorithms.

Anyway, one thing I believe I've resolved (for my stuff) is the representation of zero. Doing it via a large negative exponent is the most correct mathematically, but I'm going with a zero magnitude here. The zero sign is out because it's redundant, and redundancy just means one more thing to check and assign special case values to. Several days of work for a minor tweak, I believe that's deep in diminishing returns territory. I must say though that this exercise has really expanded my fundamental understanding of numerical representation. E.g.: one's complement is a very natural way to pack the sign and magnitude, but it's a poor basis in which approach the calculation. (Kind of hard to believe early computers actually used it for integer math, the "wrap around carry" must be rather slow in hardware.)

=========

To more efficiently support float exponential calculations, I looked into adding a variable width signed saturation opcode to the Hive ALU. A fixed width saturation is doable in two clocks, but variable would probably need three clocks and would consume a fair amount of hardware. Fixed could either be 6 bit or 16 bit. Saturating 6 bit values with fixed 16 bit hardware would require a left shift of 10, saturate, right signed shift of 10. I'm kind of leaning toward the 6 bit (to limit shift distance). I also looked into removing the variable width sign extension to make room for an immediate saturation, but realized it can be used to detect saturation. And I also looked into hardware limiting the shift distances, but that's slow and not a panacea for all cases. Hmmm...

=========

Ran across an interesting 2014 article yesterday by Walter Bright (the author of D) on writing your own language: LINK.

Oooh, Mr. Bright is an interesting guy! * "I'll note here that working on stuff that I needed has fared quite a bit better than working on stuff that I was told others need."* (link) Almost nothing I've worked on for employers has even made it to market - not the way to spend one's life.

Dew said: *“Almost nothing I've worked on for employers has even made it to market - not the way to spend one's life.”*

You are doing it again and I am not saying this to be mean. If ever I pick on you it is as my little brother. You have great knowledge and what frustrates me I was rarely able to tap into it.

I did assembly in Z80 and 8080 processors back in the day because processors were slow and memory was expensive. Someone can know how to use every opcode available but if they have poor logic their skills will not amount to much.

As you may remember I have been writing VBA code in Excel the last 18 months. You kind of put it down like some people put down Walmart. Everything has its purpose.

I am going to send you a private IM.

Christopher

*"I did assembly in Z80 and 8080 processors back in the day because processors were slow and memory was expensive. Someone can know how to use every opcode available but if they have poor logic their skills will not amount to much." - Christopher*

I agree completely. Digital/logical/arithmetic design is a deceptively deep art. It's even trickier when you get to specify the opcodes.

*"As you may remember I have been writing VBA code in Excel the last 18 months. You kind of put it down like some people put down Walmart. Everything has its purpose."*

VBA "works" but is clunky, verbose, and not fun at all (for me) to code in. There are much "better" scripting languages out there, MS picked VBA for whatever dull corporate reasons they do everything. The suits are always killing any fun there is to be had.

And Walmart sucks! ;-)

So I'm farting around with the floating point functions generated by the "Megawizard Plug-in Manager" in the FPGA tool (Quartus 9.1sp2). This spits out canned, pre-optimized verilog components, saving us the hard work of reinventing the wheel. One can adjust the depth of the pipelining registers to trade latency for speed. To get the floating point add up to the processor core frequency (~200MHz) the logic has to be 13 clocks deep, which consumes 994 LEs (logic elements) and runs at 212.68MHz. The floating point multiply must be at least 10 clocks deep, which consumes 364 LEs and runs at 224.57MHz.

The add needing ~3x the LEs of the multiply is partially a book keeping thing, as most of the "action" in the multiply is going on in the hardware multipliers, which are attached to the block RAM elements rather than the LEs. But it's also partially the reality of addition being more involved in terms of float setup and teardown (one of the inputs must be shifted pre-add, and the add itself may actually be a subtract). The core is currently < 3k LEs, adding just the floating point add would increase this by 33%. And the floating point multiply in software only takes 8 clocks max, which isn't terribly onerous, so I'm not too inclined to add that either.

I don't think I'm at the point of increasing the Hive core pipeline depth to accommodate this kind of floating point hardware. One could, I suppose, supply the floating point inputs via the stacks and read the results via the register set interface. This would give a bit of wrap-around in the core pipe and another clock or two in which to perform the calculations. No easy answers here, you can see why the floating point units in modern processor ALUs are independent of the integer units.

**Hardware Limiting of Shifts, Rotates, and Powers of 2**

How to handle hardware shifts, rotates, and powers of 2 is something I've been hazy about from the git-go. The default I've gone with is to treat the shift distance as modulo (bit mod 5 for unsigned powers of 2, and bit mod 6 for signed shifts and rotates) so everything wraps around at the shift distance boundaries. If you want hard limits then it's up to you to implement them in software. I think this is actually OK for rotate, but not for the rest.

Now that I've got some assembly under my belt, I think it's probably best to hardware limit shifts and powers of 2. Limiting unsigned left shift distance values beyond -32 (which makes this a right shift) should clearly produce zero (for a 32 bit machine). What about *signed* left shifts beyond -32? If the value being shifted is not negative then the answer is clearly the same (zero). If the shifted value is negative then a signed left shift of -31 results in -1 or -2, a shift of -32 gives -1, and so it seems we have our pick of -1 or zero. What about unsigned left shifts beyond +31? Well, given the binary value 100...000 a shift of +1 will give zero. So zero results aren't all that unexpected even though we think of left shifts as multiplication (this is what always gave me brain freeze). Certainly from a logical perspective we would expect a zero result for a shift distance of +32 or more, so let's do that here. And since signed left shifts are no different than unsigned left shifts for a given positive shift distance, zero covers the signed left shift case too.

How about powers of 2, where we shift a single one a given distance left from the least significant position? Using the criteria above for shift distances greater than +31, the result should be zero. For distances less than zero we might want to repeat the pattern, going from 100...000 (shift -1) down to 000...001 (shift -32), and then transition to a zero result for shift distances less than -32. This allows us to manually implement right shifts and the like via multiplication.

Is this "going to zero on both ends" the best approach to shifting? Or should we simply limit (saturate) the input distance to the range [-32:31] which is the natural range of the shift in binary? Well, limiting the input shift value would happen at a timing pinch point in the multiplier logic, and it doesn't cover cases beyond the limits. And I think the "going to zero" is actually more correct in most scenarios, and the zero can be introduced (muxed in) at the end of multiplication, so we have time to properly detect it. Should one want the previous modulo behavior, one can make the input shift distance modulo (via two shifts, or the sex/zex opcodes). Saturating to [-32:31] would have been done manually, and the new hardware limits are outside of this, so it won't behave differently with the limits in place.

Detecting out-of-range to zero is pretty easy, just bit reduction ORing of the bits [31:5] and bit reduction ANDing of the same. Register these, then XOR and pipe the result (for shifts and powers of 2 operations only) to the zero mux.

After implementing the above, I will most likely be removing the SEX, ZEX, and saturation opcodes. The first two took up too much room and can be replaced with double shifts, the latter will be rendered largely moot. SEX and ZEX kinda bugged me in that the input parameter was a bit position rather than a bit width, so good riddance to them from that angle as well.

[EDIT] With these changes the floating point add is down to 14 cycles max. as I don't believe zero inputs need to be singled out for special handling.

**Byte Me!**

I'm fairly happy with the opcode encode / decode packing at this point, but I really don't like how cumbersome it is to do byte read & write operations. And it would be nice to have a bit more granularity when it comes to opcode width. The choices at this point are 32 bit (for a/b comparison 16 bit jumps) and 16 bit (everything else) opcodes. (16 & 32 width is actually an ARM thing as well.)

It seems the siren song of byte-based memory access is pulling me onto the rocks. First things first, I modified Hive main memory to be byte-based rather than 16-bit based (i.e. the address indexes bytes). This required the installation of full byte barrel shifters at the I/O ports, where there were partial barrel shifters already. This slowed things down a bit and added a bit of bloat. It also cut the memory space and relative jump distances in half (good reasons not to do this!).

Now I'm looking at how this might shake out in terms of opcode changes. It will probably make code less efficient (more program space required for the same functionality) but it vastly increase the space and make the decode simpler. It can make assembly coding a bit easier as well with more inherent moves incorporated into the functionality. I'm thinking of just four types of opcodes:

1. 0 operand, 16 bits: [15:8] operation; [7:0] immediate for things like selection of stack for pop, unconditional jump distance.

2. 2 operand, 16 bits: [15:8] operation; [7:4] stack B select; [3:0] stack A select. The vast majority of ops.

3. 2 operand, 24 bits: [23:16] immediate; [15:8] operation; [7:4] stack B select; [3:0] stack A select. Stuff that needs a 5, 6, or 8 bit immediate, including rotates, adds, multiplies, and conditional jumps.

4. 2 operand, 32 bits: [31:16] immediate; [15:8] operation; [7:4] stack B select; [3:0] stack A select. For conditional jumps.

Not the most efficient packing, but there aren't various width immediates all over the place, just 8 bits located in the least significant byte position, or 8/16 bits at the most significant position. This also allows for B indexing of registers, and non-indexed memory access (which were a couple of other things that were bothering me a little).

Modifying the hardware (SV code) should be fairly straightforward. Modifying the simulator and assembler will probably be a bear, but I've got to do it in order to test the hardware.

This will be little-endian of course, big-ending is asinine. (ARM actually has a big/little endian hardware switch in one of the config registers - which is just one more setting to forget and break everything for days - isn't software supposedly *the* malleable thing in this hot mess?)

You may notice that there are no 8 bit opcodes. Other than things like NOP (no operation) I couldn't come up with a justification for this category intruding on the decode of the other categories.

You must be logged in to post a reply. Please log in or register for a new account.