# Analog to Digital Conversion & \(\Sigma\Delta\) Quantization

*Update: an earlier version of this post incorrectly characterized the contribution from Daubechies and DeVore.
The family of quantizing functions \(\rho_r\) that they use is not the greedy scheme detailed below but is
slightly more complicated. A special thanks goes out to Rayan Saab for a conversation we had about this detail.*

A necessary step in the signal processing pipeline involves taking analog signals, or, mathematically speaking, real valued functions defined on \(\mathbb{R}\) and representing them digitally as bit strings. We take this for granted when we listen to a song on our phone or watch a movie on our computer. It’s the kind of thing that some mathematicians like to ignore and leave for the engineers to figure out. And yet, there’s a lot of serious math that goes on behind the scenes to make sure that song you’re listening to doesn’t go haywire as your phone reads through the bitstream encoding it.

To set things up, suppose we have a function \(f: [-T, T] \to \mathbb{R}\) which represents
some natural signal we’d like to encode. Since humans can only see wavelengths between
\(390\) and \(700\) nm and can only hear frequencies between 20 and 20,000 Hz it’s safe
to assume \(f\) is *bandlimited*, or that its Fourier transform is compactly supported
in some set \([-\Omega, \Omega]\). There is a classic result known as the Shannon-Nyquist
sampling theorem which states that \(f\) can be encoded without *distortion*, or error, given samples
\( \{f(x_i)\} \) sampled at a rate of \( (2\Omega)^{-1} \). In other words, if we were to think in terms of
seconds as our unit of time, we’d need \(2\Omega\) samples per second for perfect reconstruction. Specifically, the theorem
states that
\[f(t) = \sum_{n\in\mathbb{Z}}f\left(\frac{n\pi}{\Omega}\right) \text{sinc}(\Omega t - n\pi),\]
where \(\text{sinc}(x) = \frac{\sin(x)}{x}\). It can be shown that this lower bound on the sampling
rate is sharp. That is, there exist bandlimited functions which cannot be reconstructed by this scheme if the sampling
rate is less than \((2\Omega)^{-1}\). It is for this reason that this critical threshold rate is referred to as the
Nyquist rate. Practically speaking, most sensors sample at a rate above the Nyquist rate to prevent aliasing and increase resolution.

It’s worth mentioning that this reconstruction scheme is far from optimal. The reason being that sinc decays like \(x^{-1}\), so any error in measuring \( f\left(\frac{n\pi}{\Omega}\right)\) is going to accumulate even at higher frequencies. There is a rich field of math which explores similar ways of reconstructing signals called frame theory. In particular, Gabor frames and wavelets are great examples of how frame theory has come to shape modern day signal processing.

What’s important is that we’ve reduced a real-valued function down to (finitely many) point
samples, i.e. a vector over \(\mathbb{R}\). In some sense, this is why in fields like compressed
sensing the signals of interest are always vectors instead of continuous functions. Generalizing the discussion above,
in place of pointwise evaluations we suppose we have *linear* measurements \( y = Ax \), where
\( x \in \mathbb{R}^N \), and \( A \in \mathbb{R}^{m\times N} \) is some linear map.
We’re still not done quantizing, since these measurements take on values
in the continuum. One way or another, we’ll need to estimate \(y\) from vectors in a discrete set, sometimes
called an *alphabet*, \(\mathcal{A}\). The natural first guess is to make a uniformly spaced grid over
\( [-B,B] \) for some chosen \(B\) and just round each component of \(y\).

The above scheme is often referred to as Memoryless Scalar Quantization (MSQ). We’ll see in a bit why it’s called memoryless.
Now, if we want our grid to have resolution \( \delta \), we’ll need \( m\log\left(\frac{B}{\delta}\right) \) bits.
The number of bits is sometimes called the *rate*. For fixed \(m\) and letting \( \mathcal{R} = m\log\left(\frac{B}{\delta}\right)\), we find
that the quantization error for one component is bounded by \( \delta = B 2^{-\frac{\mathcal{R}}{m}}\). In other words, the distortion from quantizing using MSQ decays
exponentially with the rate. So we might think this is the best we could do and call it a day. What if, however, we were working
with a circuit which had low storage capacity and could only expend, say, 8 bits. Are we stuck with a mediocre distortion bound?

Fortunately the answer to the above question is no. Just as oversampling improves resolution in analog to digital conversion increasing \(m\) will lead to lower distortion in recovering \(x\). Goyal and coauthors showed that for MSQ the error in reconstructing \(x\) as a function of \(m\) cannot decay faster than \(O(m^{-1})\). There are, however, quantization schemes which, like MSQ, enjoy an exponentially decreasing relationship between distortion and rate but also have a distortion which decays like \(O(m^{-r})\) where \(r\) is any positive integer. Unlike MSQ these encoding schemes “remember” quantization error.

These other quantization schemes fall under the category of noise shaping.
I will focus particularly on \(\Sigma\Delta\) quantization. Whereas MSQ quantizes \(y_i\) simply based on the value
that \(y_i\) takes, \(\Sigma\Delta\) quantizers feature a state variable \(u\) which encodes the quantization
error of the previous \(r\) components of \(y\) and uses this state variable to quantize the values of \(y\).
Given a fixed positive integer \(r\), an alphabet \(\mathcal{A}\), and a function \(\rho_r: \mathbb{R}^{2r} \to \mathbb{R}\),
\(y_i\) is quantized as
\[ \begin{align}
q_i &= Q_{\mathcal{A}}(\rho_r(u_{i-1},…, u_{i-r}, y_{i},…, y_{i-r+1})) \newline
(\Delta^r u)_i &= y_i - q_i
\end{align}
\]
where \( (\Delta u)_i = u_i - u_{i-1}\) is the difference operator and \(Q_{\mathcal{A}}\) rounds to the nearest
point in \(\mathcal{A}\). For logistical reasons, one needs to choose \(\rho_r\) carefully to ensure that
\( \|u\|_{\infty} \leq C \). Quantization schemes which admit this property are called *stable*. Daubechies and DeVore
were the first to prove that a particular family of \(\rho_r\) is stable in the context of bandlimited functions, even in the case where \(\mathcal{A}\) is fixed.
Apart from these, there are a handful of other \(\rho_r\) that are known to be stable in the \(\Sigma\Delta\) literature depending on whether the alphabet \(\mathcal{A}\) is fixed or allowed to grow.
For the sake of concreteness, we’ll focus on the particular family
\[ \rho_r(u_{i-1},…, u_{i-r}, y_{i},…, y_{i-r+1}) = y_i + \sum_{j=1}^{r} (-1)^{j-1} {r\choose j} u_{i-j}. \]
This scheme ensures that \(\|u\|_{\infty} \leq \frac{\delta}{2}\) if one uses the alphabet \(\{\pm(2j + 1)\delta/2, \,\, j\in\{0,…, L-1\}\}\)
and \(L\) is chosen so that \(L \geq 2\lceil\frac{\|y\|_{\infty}}{\delta}\rceil + 2^r + 1\). That is, this scheme is stable if the alphabet
is allowed to grow exponentially with respect to \(r\).

This is all rather opaque, but for the simple case of \(r = 1\), solving the recurrence relation for the \(u\) terms yields \[ \begin{align} q_i = Q_{\mathcal{A}}\left(\sum_{j=1}^{r+1} y_{i-j+1} - \sum_{j=1}^{r} q_{i-j} \right) \end{align}\] and for \( r = 2 \) we have \[ \begin{align} q_i = Q_{\mathcal{A}}\left(\sum_{j=1}^{r+1} j y_{i-j+1} - \sum_{j=1}^{r} (j+1) q_{i-j} \right) \end{align}\] As someone who dabbles in statistics, I am tempted to interpret this in terms of quantizing the difference of \(0^{th}\) moments for \(r = 1\) and the difference in means for \(r = 2\). Unfortunately this explanation is disingenuous and doesn’t hold up for \(r > 2\). It appears to be the case that the coefficients in the summands are simplicial numbers which, to the best of my knowledge, have no important applications in quadrature rules or estimating moments/cumulants.

I think engineers view this in terms of high-pass and low-pass filters. Namely, if the difference between \(y_{i-1}\) and \(q_{i-1}\) is significant due to quantization error, the rate of change of \(\sum_{j=1}^{r} (-1)^{j-1} {r\choose j} u_{i-j}\) isn’t likely to be as large as that of the fluctuation in \(u\). Appealing to mathematical intuition, this is simply because integration “smoothes out” small perturbations, and \(\sum_{j=1}^{r} (-1)^{j-1} {r\choose j} u_{i-j}\) is acting as a quadrature rule for integrating \(u\). As such, \(q\) is going to be composed of lower frequencies compared to the quantization error encoded in \(u\). Pushing error into higher frequencies is, after all, the name of the game in noise shaping.

Long story short, noise shaping techniques like \(\Sigma\Delta\) enjoy favorable distortion decay
as you oversample. Namely, if you use a \(r^{th}\) order \(\Sigma\Delta\) scheme with an appropriately chosen alphabet
(see Section 2 in this paper for a more thorough treatment; importantly, the one bit alphabet and uniform grids
as mentioned above are included as examples), then the reconstruction
error (in Euclidean norm) for recovering \(x\in \mathbb{R}^N\) from \(q\) decays like \(O(m^{-r})\).
Let me repeat: this is true even in the extreme case where your alphabet is \(\{-1,1\}\), i.e.
*one bit \(\Sigma\Delta\) quantization preserves information
about the length of the vector \(y\)* while MSQ clearly fails to do so.

If by now you’re on the \(\Sigma\Delta\) bandwagon and want to incorporate it into any of your projects, feel free to use some MATLAB code I wrote. For recent applications of \(\Sigma\Delta\) quantization in compressed sensing, I’ll shamelessly promote this work on recovering low rank matrices with quantization and this work which uses \(\Sigma\Delta\) quantization with structured random matrices in compressed sensing.