It's rather eerie that a Taylor series polynomial can approximate as many sine (or cosine) cycles as you want:
sin(x) = (x^1 / 1!) - (x^3 / 3!) + (x^5 / 5!) - (x^7 / 7!) + ...
Because it only has odd powers of x, sine is an odd function. So the polynomial, truncated to however many terms which provide sufficient accuracy, may be evaluated with half the usual effort. But the coefficients will have to be carefully adjusted to best fit the truncated series that we settle on.
First we have to decide what form the input and output will take. To make the function more generally useful, and to maximize resolution, we can normalize the input angle z so that the full range of input produces one full rotation (x = z * pi / 2), and normalize the output to produce the full range of values over one rotation. Both input and output are signed, and in a perfect world the output sign would follow the input sign naturally, but doing so wouldn't leave enough headroom for the calculations, which must instead be performed in an unsigned manner over a single quadrant.
I used trial and error to obtain the coefficients, which requires a fair amount of time and patience. It helps to train yourself by solving the simplest case first and then working your way up. In all cases the first "hump" in the error curve from input zero to positive should be negative, and there should be as many "turns" in the error curve as there are terms. For a 32 bit process that employs a Hive type "extended unsigned" multiplies (i.e. the upper 32 bits of a 32 x 32 unsigned multiplication) the errors are as follows:
one x term: -45M
two x terms: +/-1.3M
three x terms: +/-175k
four x terms: +/-15k
five x terms: +/-10
six x terms: +/-2
For 32 bit representations it seems worth going to 6 terms as the error here is negligible. It might be possible to reduce the error to +/-1 as there don't seem to be many input values that give error larger than +/1, but I ran out of patience. One must be careful to produce a slightly negative error at the end of the polynomial or the output will overflow.
Mathematical operations are minimized by interleaving the power operations in with the coefficient operations:
y = Ax - Bx^3 + Cx^5 - Dx^7 + Ex^9 - Fx^11
y = x[A - x^2(B - x^2[C - x^2(D - x^2[E - x^2(F)])])]
Before the initial squaring, the input is shifted left two positions, which removes the extraneous quadrant info, maximizes the resolution of the calculations, and provides sufficient headroom for the A coefficient. After this, the input signed to unsigned is handled via a subtract 1 followed by a bit NOT of the input if the second most input MSb is set. The subtract 1 step is skipped if the input is the full scale negative value, which neatly accomodates it, and convenenetly in the area of the graph where it is flattest and therefore the output is changing the least. Output unsigned to signed is handled similarly if the input MSb is set, but here the coefficients can guarantee that there will be no overflow.
To get Cosine from this we simply offset the input value by +1/4 or -3/4 full scale (add 0b01 to, or subtract 0b11 from, the input MSbs).
Note that the output is not necessarily monotonic, particularly for the higer term cases, because the LSbs have truncation noise.
I've got it coded up in the Hive simulator, results match the Excel spreadsheet. 24 instructions minimum, 29 maximum (depending on the input value). Sine is one of those things that's hard to tell when you're really done figuring out all the nuances.