Teensy 4.0 600MHz ARM Cortex M-7 MCU - ideal for digital MCU based theremin?

From: Northern NJ, USA

Joined: 2/17/2012

"Phase shift sensor seems worse than classic heterodyne." - Buggins

Worse how? BWT I do appreciate you keeping us updated on this project!

"This design should provide precise output, w/o aliasing unlike D-trigger mixers. (Btw, does XOR+LP mixer have aliasing?)"

A D flip-flop mixer will only change output with clock edge, so yes, XOR+LP should be better, it always was in my simulations. The tricky part was the LPF, to get the ripple low enough to then threshold the result without "blips" you need high order, low Q, so a cascade of RC (or equivalently first order digital LP clocked at a high rate). The common cutoff frequency has to be placed so that the amplitude of the highest fundamental doesn't get too attenuated. I looked into this a lot as a path to linearization, but abandoned it when I found the math method I'm using now. It's kind of a can of worms, and can tie you to certain operating points / specific L values, so I was rather glad to see it in the rear view mirror. But I wish you luck!

Posted: 2/20/2020 11:48:58 AM 73

From: Porto, Portugal

Joined: 3/16/2017

"Phase shift sensor seems worse than classic heterodyne." - Buggins
Worse how? BWT I do appreciate you keeping us updated on this project!

Best "zoomining" (with balanced Q - big enough for good zooming, but still able to cover 1.5pF C_hand range, and to be able tuning F_ref near resonant frequency - limitations from timer precision - 240MHz bus clock) I managed to reach with phase shift sensor is 2.5% of reference frequency period covering 0.025pF (>=33cm hand to antenna distance).

Inside of this 2.5%, C_hand to phase shift relation is almost linear.
So, phase shift measured with 240MHz timer has range 0..6 of full 1MHz period 240. Tracking both edges would double this range: from single F_ref cycle we will get value in range 0..12
To get better resolution, we need to average measured values. For 1ms averaging window, we have 0..12000. Resolution would be 0.025/12000 = 0.000002pF - it corresponds to 2mm distance measure precision near 80cm hand to antenna distance.
Sounds not bad. Why is it worse than neterodyning?
Averaging requires "dithered" measured values. Measuring timer and F_ref both are synchronous. So there is a big chance that there will not be visible dithering for measured intervals out of the box. Mixing additional noise after phase shift output buffer could help, but it's unclear how to implement it.

At first sight, phase shift sensor is great idea - only two buffers + LC tank, no other components - reference frequency and shifted signal occupu only 2 pins.
But due to timer resolution limitations and unclear dithering perspective, for MCU based sensor, even with 240MHz timers, phase shift approach does not seem to be a good solution.

A bit different conclusion may be for higher resolution timers - e.g. FPGA based implementation.
Let's consider Xilinx 7 series hardware. It's possible to use OSERDES to produce oversampled output and generate F_ref with 1200MHz (150MHz*8) precision - 5 times better than Teensy 4 @960Mhz overclocking. It decreases limitation of F_ref quantization which forces F_ref to be low enough to allow tuning near resonance - higher Q inductors may be used (with higher frequencies). Higher Q allows to get better zoom factor (non-linearity).
Simple phase shift measure implementation on ISERDES gives 1200MHz precision. With xN oversampling (measure of signal several times with different delay) we can increase precision N times (for x4 oversampling, it will be 4800Mhz, for x8 - 9600MHz).
ISERDES + *4 oversampling gives 20 times better sensitivity. With higher sensitivity, probably noise will work better for dithering.

"This design should provide precise output, w/o aliasing unlike D-trigger mixers. (Btw, does XOR+LP mixer have aliasing?)"
A D flip-flop mixer will only change output with clock edge, so yes, XOR+LP should be better, it always was in my simulations. The tricky part was the LPF, to get the ripple low enough to then threshold the result without "blips" you need high order, low Q, so a cascade of RC (or equivalently first order digital LP clocked at a high rate). The common cutoff frequency has to be placed so that the amplitude of the highest fundamental doesn't get too attenuated. I looked into this a lot as a path to linearization, but abandoned it when I found the math method I'm using now. It's kind of a can of worms, and can tie you to certain operating points / specific L values, so I was rather glad to see it in the rear view mirror. But I wish you luck!

When using ~50% reference frequency signal duty cycle we would get output similar to XOR mixer, with probably simpler LP filter (analog switch samples source signal for half of period.

Ideal LP filter after XOR should give ideal triangle wave. To exclude bouncing at crossing threshold point of output buffer, we could use buffer with Schmitt. Although, it would be sensitive to noise.
Analog Switch based mixer morphs triangle to almost square - sharpens edges.

Posted: 2/20/2020 3:57:35 PM 74

From: Northern NJ, USA

Joined: 2/17/2012

"Best "zoomining" (with balanced Q - big enough for good zooming, but still able to cover 1.5pF C_hand range, and to be able tuning F_ref near resonant frequency - limitations from timer precision - 240MHz bus clock) I managed to reach with phase shift sensor is 2.5% of reference frequency period covering 0.025pF (>=33cm hand to antenna distance)." - Buggins

Yes, one big issue with the phase shift method is you can't have Q too high or too low. It's not very difficult to make a high Q arrangement and then explicitly damp it some, but rather like transistor Q it is an ill-specified parameter, so maybe you don't want to rely on it too heavily on for critical functioning.

Another issue I presume is how to position the operating point on the phase "ski slope" so you get the response you expect. And the "ski slope" is non-linear, so I presume this positioning would also affect playing linearity?

Other than the simplicity, I do like the way the phase shift method can be made to give high resolution in the far field and poop out in the near field, which is just the opposite of the PLL and subtract method I'm using. But the PLL method can be done with fairly minimal hardware, probably less than what I'm using, and it pretty much completely removes dependence on L and Q. And the PLL method completely decouples linearity and sends it to the math department down the hall.

"Averaging requires "dithered" measured values. Measuring timer and F_ref both are synchronous. So there is a big chance that there will not be visible dithering for measured intervals out of the box. Mixing additional noise after phase shift output buffer could help, but it's unclear how to implement it."

Yes, dithering is critical when the sampling frequency is rather low. But introducing it lowers the SNR, and therefore the precision of your averaging. I was using pseudo random noise added to the PLL accumulator thresholding, but the high frequency components of the noise were getting filtered out by the LC. Now I'm using a triangle wave that is synchronous to the sampling period (which places a zero there) and low frequency enough that it isn't entirely LC filtered out, but high enough to get filtered out by the sampling side. There is a noise subtraction technique where if you are generating the noise (dither) you are injecting at the input you just subtract it at the output and dramatically improve SNR, but the LC filters so much you unfortunately can't really use it in this scenario I don't think. High Q oscillation is a tough thing to budge with dither.

Posted: 2/20/2020 4:32:12 PM 75

From: Porto, Portugal

Joined: 3/16/2017

Another issue I presume is how to position the operating point on the phase "ski slope" so you get the response you expect. And the "ski slope" is non-linear, so I presume this positioning would also affect playing linearity?

Influence of offset between F_ref and LC resonant frequency can be removed by additional calculation (this offset and Q can be calculated during calibration).
Limited F_ref (quantization because only F_bus/divider are possible) actually limits usage of high frequencies for which it's easy to build extra high Q inductor
But this issue can be bypassed by usage of external PLL ICs with better resolution (e.g. controlled via SPI interface).

Actually, if MCU has good enough ADC, phase shift can be measured using intermediate analog form (instead of time interval we can measure voltage).
(F_ref XOR F_shifted) -> LP filter -> ADC.

E.g. 2% phase shift range gives 2% voltage change at LP filter output, and measured using 16bit ADC gives 10 bits of useful data (for all distances > 33cm)
33cm = 10bits
43cm = 8bits
53cm = 6bits
63cm = 4bits
Averaging of several ADC measures will give some additional bits. But anyway, ranges 70cm and bigger should be mostly unworkable for this approach.

Other than the simplicity, I do like the way the phase shift method can be made to give high resolution in the far field and poop out in the near field, which is just the opposite of the PLL and subtract method I'm using. But the PLL method can be done with fairly minimal hardware, probably less than what I'm using, and it pretty much completely removes dependence on L and Q. And the PLL method completely decouples linearity and sends it to the math department down the hall.

Problem here is that "zoom" is not high enough - only ~20-100 times. (Change C_hand by 0.025pF gives 2% of phase shift while 0.002% near antenna). Heterodyning gives much better zooming / damping, as I learned from recent simulations.

Posted: 2/20/2020 5:04:53 PM 76

From: Northern NJ, USA

Joined: 2/17/2012

"Limited F_ref (quantization because only F_bus/divider are possible) actually limits usage of high frequencies for which it's easy to build extra high Q inductor." - Buggins

It's quite a luxury doing things digitally largely because of this, being able to use smaller L, higher Q coils.

"But this issue can be bypassed by usage of external PLL ICs with better resolution (e.g. controlled via SPI interface)."

Or maybe an external FPGA where you custom design everything exactly for the purpose? We need tiny, cheap, full-featured FPGAs with built-in configurators, MAX2 was close but too expensive. China is making FPGAs now, maybe they'll come up with something. At work we would use a configurator during early develpment and switch to processor pump for production. If you have enough Flash memory free on the MCU this is an option to lower the cost of an FPGA. It also makes the FPGA load part of the SW load, which is nice.

"Actually, if MCU has good enough ADC, phase shift can be measured using intermediate analog form (instead of time interval we can measure voltage).

(F_ref XOR F_shifted) -> LP filter -> ADC."

But then you're back in the noisy analog world doing precision measurements. Which is why I was trying to do the heterodyning digitally.

"Problem here is that "zoom" is not high enough - only ~20-100 times. (Change C_hand by 0.025pF gives 2% of phase shift while 0.002% near antenna). Heterodyning gives much better zooming / damping, as I learned from recent simulations."

I agree, and like your word "zoom" for this. Heterodyning can dramatically magnify a difference in the frequency domain that you then must somehow mix / filter / capture. For low powered MCU / ~toy Theremin this is good because the low frequency result is more easily measured. I'm pretty sure there is some way to do it in a high quality way but I couldn't find it (I worked on it for months, mostly in Excel sims, and even had some SV code ready to go). In the end it was so confining that I was relieved when I was able to drop it and move on.

Posted: 2/21/2020 6:13:26 AM 77

From: Porto, Portugal

Joined: 3/16/2017

I agree, and like your word "zoom" for this. Heterodyning can dramatically magnify a difference in the frequency domain that you then must somehow mix / filter / capture. For low powered MCU / ~toy Theremin this is good because the low frequency result is more easily measured. I'm pretty sure there is some way to do it in a high quality way but I couldn't find it (I worked on it for months, mostly in Excel sims, and even had some SV code ready to go). In the end it was so confining that I was relieved when I was able to drop it and move on.

Or maybe an external FPGA where you custom design everything exactly for the purpose? We need tiny, cheap, full-featured FPGAs with built-in configurators, MAX2 was close but too expensive. China is making FPGAs now, maybe they'll come up with something. At work we would use a configurator during early develpment and switch to processor pump for production. If you have enough Flash memory free on the MCU this is an option to lower the cost of an FPGA. It also makes the FPGA load part of the SW load, which is nice.

If goal is to design a project which is cheap and easy to build by everyone, MCU is good choice even if part of processing is done in analog.

Although, easy to buy cheap enough board should be suitable solution, too.

BTW, my order of cheap chinese QMTECH Zynq Z7010 board will be received soon.

We do know that heterodyning is very good method to get high sensitivity at far hand distances.
XOR mixer most likely does its work well with the only issue that it requires good LP filter and noise can affect LP output zero crossing detection.
What if we try to implement high precision XOR heterodyne in FPGA?
No analog noise, only one introduced by digital filters.
Zero crossing can be detected with higher precision using interpolation (time of crossing will be obtained with sub-clock precision).

Even straightforward XOR mixer emulation is possible.
We can read oscillator signal from FPGA pin.
We can emulate reference oscillator signal inside FPGA.
One bit per clock cycle is not enough, it's better to use oversampling when reading pin and generating reference frequency signal.
Reference frequency should have bigger precision, otherwise it would be impossible to have desired small (F_ref-F_osc).
E.g. we can use deserializer which can provide N deserialized bits per clock cycle. On Zynq we can use ISERDESE2 module in DDR mode, giving 8 bits of source signal each 150MHz clock cycle.
Reference frequency generator should produce 8 bits each clock cycle. It will give 8 times better precision - 1200MHz instead of 150MHz.
So, each clock cycle we have 8 input bits and 8 reference frequency bits.

logic [7:0] osc_deserialized; // from oscillator
logic [7:0] fref_deserialized; // from reference clock generator
Let's calculate for each cycle xor_out = SET_BIT_COUNT(XOR(osc_deserialized, fref_deserialized)) -- value in range 0..7 - count of set bits in 8-bit per-bit XOR result.

Then it can be passed to several stages of LP filter (simple FIR with bit shifts?)

Each cycle we will have new output from LP filter.

Then this value will be processed by edge detector.

Code:

prev_sample; // previous LP output (delayed by 1 cycle)
new_sample; // new LP output
// thresholds may be different (like in schmitt)
if (prev_sample = pos_threshold) {
    // pos edge detected
    // interpolate position based on prev_sample/new_sample - it should have sub cycle precision
} else if (prev_sample >= threshold && new_sample < neg_threshold) {
   // neg edge detected[/font][/size][/color]
   // interpolate position based on prev_sample/new_sample - it should have sub cycle precision
}

On each pos or neg edge we can calculate new value for measured period (two separate values - for pos and neg edges).

We can apply LP filter to measured T values in each clock cycle to make them smooth.

I believe performance of such digital XOR mixer should be better than analog one and good enough for usage.

Posted: 2/22/2020 11:29:54 AM 78

From: Porto, Portugal

Joined: 3/16/2017

Finally I've found how to implement high precision theremin sensor based on Teensy 4 MCU.
No heterodyne and therefore no reference frequency generation is needed anymore.

We have signal from simple LC oscillator which produces square wave at 0.8..2MHz depending on coil.
Good oscillator varies its frequency by ~7% for far (or infinity) and near hand range (assuming C_ant is 8pF, and C_hand varies in 0..1.5pF range).
Capacitance intruduced by hand decreases exponentially with increasing of distance between hand and antenna. In calculations below, let's use 3.5 times dumping each 10cm additional distance. So, each 10cm eat 1.8 bits of measured signal value. At 100cm distance, 18 bits are lost.
Our goal is to have playable sensitivity at reasonable range (80..100cm?)

With square wave signal, the only data we can collect is time positions of each raising and falling edge.

What hardware capabilities does Teensy 4 provide which can be used to read data from sensor?
There are two hardware modules which can latch timer cycle for raising or falling edge on pin.
First is QuadTimer - can be configured to latch timer counter value on pin signal edge. Provides 16 bit counter values.
Another, eFlexPWM module can latch counter value + cycle counter (effectively give 16 + 4 bits of counter value for full 16bit PWM period).
Both devices can produce DMA request and DMA could be configured to put data read from capture registers into circular memory buffer.
Both are QuadTimer and FlexPWM are suitable, but with 20bits of counter probably less issues with counter overflow will be there.
Let's assume we configured hardware to latch timer counter every raising and falling edge, and put them into circular buffer. It's possible to get current position of DMA destination pointer - to find recent data position.
Timers and input latches operate at BUS frequency, which equals to 1/4 of CPU frequency.

With default F_CPU = 600MHz, F_BUS = 150MHz
With overclocked F_CPU = 960MHz, F_BUS = 240MHz (heatsink required).

For audio playback, most likely we will use framed DMA mode with I2C interface.
Teensyduino core implements calling of software interrupt when next frame has to be filled (previous just started playing).
For 1ms frame, synthesizer code will be called with 1ms delay from current playback position, but by time prepared frame is ready, there will be 2ms delay.

During 1 ms we have to read sensors, calculate pitch and volume based on sensor data, then synthesize 1ms of sound.
We have to interpolate pitch and volume control values between frames (linear interpolation is ok).

Everything is ok once can manage to resolve simple task: take recent data points from sensor and calculate output value which has enough bits to calculate hand position with high enough precision.
For each new frame we have an array of recent N timer counter values latched at signal edges.
With 240MHz F_BUS and ~1MHz oscillator we will see sequence of values incremented by 240..260 for same edge (depending on C_hand).
If we track both edges, sequential values will be incremented by 120..130 (if oscillator duty cycle differs from 50%, position of falling edge can be shifted from middle between consequent raising edge).

So, for single signal period we have 8 bit value, but only 4 lower bits will vary with moving of hand.

Of course, it's not enough for any practical usage.
We need averaging of some kind.

First try: we have t0, t1, t2, t3, t4, t5 ... - captured on rising edge timer counter values where t0 is recent one.
t1-t0 is last period value (period1).
t2-t1 - previous (period2)
t3-t2 - period3
t4-t5 - period4
then we can average them:
(period1+period2+period3+period4)/4 -- it will give us x4 times better precision (2 more bits of useful data).
But stop. ( t1-t0 + t2-t1 + t3-t2 + t4-t3 ) = (t4 - t0)
If we subtract last value with value in past by n measured cycles (t(n)-t(0)), we will actually have value averaged for n cycles - with log2(n) additional bits of useful information (we always can divide (t(n)-t(0))/n for floating point average, but simple sum contains the same information).
By taking difference between counter values with distance 1024, we are getting 10 bits of data. What is a price for such improvement? It's latency. t(1024)-t(0) gives averaged value corresponding to moment in past 1024/2 of oscillator cycles. (With 1MHz oscillator and offset 1024 we add 0.5ms latency.

Now we have 4+10 bits.

But stop. We used only 2 values from huge array of collected data in DMA buffer.
We only used t[0] and t[1024].
What about t[1] and t[1025], t[i] and t[i+1024], t[1023] and [2047]?
These pair give another 14bit value with different averaging point offset (this point varies between 0.5ms and 1.5ms in past).

If we calculated and averaged all these pairs, we will get 10 more bits: 4+10+10=24 bits total.
Cost: sensor latency is now 1ms instead of 0.5ms. CPU load: we calculated 1024 pairs instead of one.

We used only raising edge of oscillator signal. What could use give falling edge points?
Either 1 more bit in cost of x2 more calculations, or no new bits but /2 lower latency.

Ok, let's pay 2x CPU time, and now we have 4+10+10+1 = 25 bits

Is it enough? Without 18 bits eat by 100cm distance, we still have 7 bits left. It's more than enough even at 1m distance, and we even did't take into account inter-frame interpolation.

Formulas can be regrouped, e.g.
((t4 - t0) + (t5 - t1) + (t6 - t2) + (t7 - t3)) / 4
can be written as
(t4+t5+t6+t7)/4 - (t0+t1+t2+t3)/4

(t4+t5+t6+t7)/4 and (t0+t1+t2+t3)/4 are averaged positions of edge with distance 4 between them, 2 and 6 osc cycles in past.

Epic win! High sensitive sensor design with ~2ms total latency can be implemented on Teensy 4 w/o additional hardware.

FPGA notes.

BRAM is needed to store history of recent captured data.
Every oscillator edge we will store new value, and recalculate new sensor result.
With 1MHz oscillator and 200MHz FPGA clock we have 200 FPGA cycles per one oscillator period, 100 cycles per edge.
So we can use <100 data points for updating sensor interval.
E.g. 32 + 32, with delay (interval between captures) 1024 or less.
More bits will be achieved by FIR filtering of this value.

Most likely, FPGA may sample input signal with higher resolution. E.g. instead of 240MHz in Teensy4, it can be 1200MHz using Xilinx Series 7 ISERDESE2 DDR mode. Adding oversampling based on delay lines, it can be increased by x2, x4, x8 times. Several source bits more from oscillator - we can use smaller delay (lower latency).

Posted: 2/22/2020 6:28:31 PM 79

From: Northern NJ, USA

Joined: 2/17/2012

Congratulations! Though I think you just independently discovered the higher order CIC filter!

As you describe the scenario, if you have 8 bit period measurement, with 4 bits changing or relevant, with these coming in every ~1MHz period, if you use them directly you have 4 bits of information. If you filter them with a perfect, brick wall, infinite order low-pass filter set to cutoff at 1kHz, you increase the bits by 10 (1MHz / 1kHz = 1000 = ~2^10) but half of these are effectively noise, so in the end you have 4+5 = 9 bits. Perfectly filter down another decade to 100Hz (a reasonable axis BW figure) and you get another 3.3 bits, half of which are noise, or 4+5+1.6 = 10.6 bits. I believe this is the absolute best you can do, regardless of how you do the filtering. If you have to dither at all then the bits may be reduced, particularly if you use pseudo-random noise rather than cyclic dither (cyclic dither can be somewhat removed by setting the dither frequency to some integer multiple of the sampling frequency, placing a zero there).

The CIC works best for really large decimation, or sample rate reduction, as you are doing here. It is routinely found as the first filtering stage of delta-sigma A/D converters because they operate in a highly oversampled regime. For best results, the input and output should be synchronous to each other, that is the decimation factor should be constant, otherwise I believe you will be introducing noise and perhaps aliasing. It has a lot of pass-band droop, and unless you go to at least third order the filtering cutoff action may not be sufficient, but every order increase by one doubles the bit width of the accumulators, which quickly gets out of hand in hardware (though I know you are doing this in software). So, even though the CIC uses no multipliers (arguably it's biggest advantage in hardware), the super wide accumulations can easily slow things down to the point where the hardware can't keep up with the input data rate. Or the accumulations may need to span multiple registers in your processor, requiring much more in the way of real-time.

When I was looking at period measurement I tried like a demon to get the CIC to work (it seemed at the time like magic), but I just couldn't get around the fact that the period measurement was asynchronous to my sampling rate of 48kHz. I also ran into a trivial solution that I wasted some time on for a while, which is: when you combine all the period measurements and subtract the end from the beginning, what do you get? A constant number that is equal to your sampling time! In a way, you are really only interested in how many samples you have in the buffer, not how long each one is, and for your filtering / averaging process to work right it would need to be able to handle fractions of samples, something I don't believe the CIC does well.

When I went to DPLL and frequency measurement I was finally able to use a third order CIC because I could then arrange things so the filter I/O were synchronous with each other. I was doing the fast accumulation cascade in FPGA hardware and the much slower difference cascade in software, which worked out great (the CIC allows you to bunch these operations together, thus dramatically lowering data storage requirements). The NCO phase accumulator even formed the first CIC integrator! But third order takes really wide accumulators, and I found I could use that same hardware to do a cascade of four "fast" 1st order filters, which give a much better cutoff response for decimation, and I can sample the output at whenever rate I like, thus decoupling all the synchronous nonsense. If you don't care too much about the exact cutoff frequency, IIR filters can use simple shifts rather than multiplies, and drastic sample rate reduction is often one of these scenarios (albeit with unwanted phase shift for audio work).

It may well be that a second or third order CIC in software is your best solution here, I don't know. But the CIC is a solution to a fairly narrow set of DSP situations, and it has multiple severe limitations when compared to plain old filtering. The same processor cycles could do IIR or FIR or some combination, and you would have much more control over the filter response, whereas CIC response is pretty much one-size-fits-all.

Also, CIC operations can be pared down somewhat in width without hurting SNR too much, though I've never seen a good explanation of exactly how to do it, and so have always felt uncomfortable doing so in my CIC filters. I believe the design process is very similar to the way we are "throwing out 1/2 of the averaging / filtering bit increase as noise". You only perform the calculations out to a width which makes the resulting truncation noise less than the averaging noise width. The same thing happens when you cascade IIR stages: the accumulator value invariably gets truncated between stages, but it is a more obvious process.

IIRC the original CIC paper was written by an engineer who didn't have a lot of DSP experience, which makes reading it something of a blessing and a curse.

[EDIT] There is now a link here from the D-Lev thread.

Also wanted to say that the bit increase from a second filtering (e.g. second order filter) won't yield the same benefit as the first filtering (e.g. first order filter). The bit increase is diminishing returns until you hit the limit described above. This is how I understand it anyway. I could be somewhat wrong with the specifics, but that's the general trend. Otherwise we could just keep adding filtering or averaging stages until our precision was effectively perfect, something that we know can't happen for a given bandwidth. The bandwidth narrowing itself is what yields the bit increase, and subsequent stages can only incrementally reduce the stop band area by increasing the cutoff slope. There are hard limits to "how much you can know" vs. "when you can know it" given noisy data. The resolution / speed trade-off is similar to the Heisenberg Uncertainty Principle. And the "noise" in all of this is sample uncertainty.

Posted: 2/22/2020 8:01:09 PM 80

From: Porto, Portugal

Joined: 3/16/2017