Jekyll2022-03-05T17:06:28-08:00https://elybrand.github.io//feed.xmlEric LybrandPersonal HomepageEric LybrandRobots & Calculus2020-11-20T00:00:00-08:002020-11-20T00:00:00-08:00https://elybrand.github.io//robots<p>In 2017, I built a robot arm from the <a href="https://inmoov.fr/">InMoov project</a> for my multivariable calculus students. This class starts with vectors and ends with differential forms, manifolds, and Stokes’ theorem. I was inspired by my fatigue of qualifying exam courses and the lineage of mathematicians at UGA who also built robot arms (shout out to <a href="https://dmillard.github.io/">David Millard</a> and <a href="https://robotics.uga.edu/cantarella-jason/">Jason Cantarella</a>).</p>
<p>Robotics turns out to be a surpsingly good motivation for a lot of topics in multivariable calculus. I had originally set out to use the robot arm as a motivator for manifolds since I had a hard time wrapping my head around what a manifold was when I first learned about them. The classic quote by topologist Schmuel Weinberger succintly describes what most math students come to realize in their first course on manifolds: “manifolds are a bit like pornography: hard to define, but you know one when you see one.” There are plenty of contexts where manifolds naturally appear such as differential geometry and mathematical physics. Having taught math for now 5+ years though I can tell you that very few first and second year math students daydream about differential geometry or mathematical physics. On the other hand, I have found that students do get excited about modern and tangible applications of math.</p>
<table>
<tbody>
<tr>
<td><img src="/assets/images/2020_robot/manifold.jpg" alt="manifold_defn" /></td>
</tr>
<tr>
<td>Fig. 1: The formal definition(s) of a manifold.</td>
</tr>
</tbody>
</table>
<h2 id="what-i-learned-from-building">What I Learned From Building</h2>
<p>I’ll admit I went into this project blindly. I started in January thinking I would have plenty of time to 3D print the pieces, assemble them, and have a demo ready by March. This ended up being an overly optimistic timeline. My responsibilities as a student and a teaching assistant aside, the logistics of 3D printing the pieces were more complicated than I first thought. All my 3D printing happened at UCSD’s <a href="https://library.ucsd.edu/computing-and-technology/digital-media-lab/index.html">Digital Media Lab</a>. This is an incredible service provided through the campus library and it allowed me to print all of my parts at no cost. However, I was competing with lots of engineering students for access to these printers. Furthermore, I could only reserve a printer for 3 hours a day. Anyone who has done any 3D printing can tell you that 3 hours is not that much time to print even modestly sized objects, and some prints can be very error prone. The print could fail to adhere to the build plate. The heating element in the extruder could drop a few degrees and the filiment could clog the extruder. The build plate could be uneven. The extruder could be miscalibrated, or slip. A clumsy passerby could accidentally hit your print. Some prints are awkward for a 3D printer parameterized with Cartesian coordinates (think spheres and thin, curved sheets). Any of these errors could cause issues at any time, especially near the end of a very long print.</p>
<p>The logistics notwithstanding, I ended up learning a lot about 3D printers because I wasn’t guaranteed a dedicated printer. The models that the Digital Media Lab had at the time included a few Makerbot Replicators, an Ultimaker 2+, a Prusa i3 MK2, and a HE3D K280. For the record, when I eventually purchase a 3D printer the Prusa will be near if not at the top of my list. Not only is it the cheapest option among the ones I listed, it also gave some of the highest quality prints for my robot arm. The Ultimaker 2+ came pretty close but the price tag was 3x that of the Prusa. Apparently I’m not alone in thinking that the Prusa is superior. As of the time of this writing the Digital Media Lab has replaced all of their 3D printers with Prusas.</p>
<p>Naturally there were issues even once I had all of the parts printed. Hours were spent carefully sanding pieces so that they fit together just right (“beware the build up of frictional forces” my Digital Media Lab Sensei <a href="https://www.linkedin.com/in/migueldv/">Miguel de Villa</a> often reminded me). The fishing line that the servomotors pulled to contract the fingers might snap or come undone. Even though I hadn’t anticipated any of these issues, a part of me enjoyed these small challenges. These kinds of issues are entirely different than the problems that were occupying my time in my numerical analysis qual class. These hardware issues reminded me to appreciate good engineering.</p>
<table>
<tbody>
<tr>
<td><img src="/assets/images/2020_robot/robot_half_complete.JPG" alt="robot_half_complete" /></td>
</tr>
<tr>
<td>Fig. 2: A near complete robot arm.</td>
</tr>
</tbody>
</table>
<h2 id="what-i-hope-my-students-took-away">What (I Hope) My Students Took Away</h2>
<p>By the time the calculus class got to manifolds I had finished printing the right arm for the InMoov robot and had the left arm half assembled. The right arm was sufficient to talk to my students about a lot of concepts they learned over the year long sequence. We spent time talking about how manifolds model configuration spaces for robots, how vector fields and a little bit of electricity and magnetism explain how the servomotors powering the arm operate. We even got to talk a bit about my woes with Cartesian coodinates when 3D printing quadrants of the forearm which are naturally more cylindrical than cubical.</p>
<p>My students really got a kick out of the robot arm. Unlike the analogous course I took at UGA, this batch of multivariable calculus students had a pretty diverse set of majors which included math, physics, pre-med, biology, and engineering majors. I never worried about motivating the course for the physics students since physics is comfortably couched in many calculus problems. I did worry about motivating calculus for those students who weren’t pursuing degrees in the physical sciences, or maybe hadn’t yet decided what they wanted to study. I remember being pretty clueluess about what I wanted to do up until halfway through undergrad. The only reason I had enrolled in math classes were to satisfy prerequisites for my then physics major classes. It wasn’t until I took numerical analysis at UGA did I understand just how broadly applicable subjects like calculus and linear algebra were. That class was truly magical, in no small part due to how well Jason Canterella designed it. It was project based, and the students competed against one another. Our optimization project had us code up various non-convex optimization algorithms to find the shortest path across an actual mountain range parameterized by topographic coordinates. Our differential equations project was to solve for the initial firing velocity and the angle at which to <a href="https://www.youtube.com/watch?v=58MmOpSm4LY">fire a rail gun</a> from Pearl Harbor to hit a boat many miles west in the Pacific. We ended our numerical linear algebra studies by designing an image compression algorithm (<a href="https://fredhohman.com/">my friend Fred</a> won that one). Learning finite element method, Nelder-Mead, and the singular value decomposition was one thing. Using those tools to solve legitimately interesting problems stirred my spirit.</p>
<p>So I guess I was trying to recreate that sense of wonder that I had in my undergrad for my students. Like great teaching, that sense of wonder is hard to quantify. If I happened to inspire, however briefly, anyone in that quarter of MATH 31CH then I’ll consider my mission accomplished.</p>Eric LybrandIn 2017, I built a robot arm from the InMoov project for my multivariable calculus students. This class starts with vectors and ends with differential forms, manifolds, and Stokes’ theorem. I was inspired by my fatigue of qualifying exam courses and the lineage of mathematicians at UGA who also built robot arms (shout out to David Millard and Jason Cantarella).Brain Waves2020-10-05T00:00:00-07:002020-10-05T00:00:00-07:00https://elybrand.github.io//neurodsp<p>I was invited to work with <a href="https://voyteklab.com/">Brad Voytek’s lab</a> in the Cognitive Science department during the summer of 2020. Voytek Lab consists of 16 lab members ranging from undergraduate researchers to postdocs. Their interests are varied, but they primarily study neural oscillations from a computational neuroscience and experimental approach. In particular, they’re interested in how to quantify various statistics about neural oscillations and discerning whether certain statistics correlate with physiological traits.</p>
<p>Brad contacted me to see if I could help with a few things with their python package <a href="https://neurodsp-tools.github.io/neurodsp/">NeuroDSP</a>. Beyond doing a basic technical audit for their digital signal processing toolkit, other ideas that were tossed around were looking at phase spectra of neuro signals, burst detection (eerily <a href="https://elybrand.github.io/indoor_localization/">reminiscent of the indoor localization project</a>), and fleshing out their simulations for unit testing and other methods comparisons. I ended up spending most of my time on the latter, partly because I quickly realized that neuro signals exhibit incredibly complex structure.</p>
<h2 id="what-you-can-and-cant-learn-from-the-periodogram">What You Can and Can’t Learn from the Periodogram</h2>
<p>Any course in graduate time series analysis will spend lots of time analyzing ARMA models and stationary time series. In my case, the argument for reducing from nonstationary to stationary time series didn’t even take a whole lecture and effectively amounted to centering the time series, and then looking at the periodogram and subtracting off cosines until the spectrum lacks any obvious peaks. If you want to get fancy, you could fit local linear models for trend estimation instead of assuming a constant trend.</p>
<p>Forget for a moment that any interesting time series is interesting precisely because of its nonstationarities. This by now classical signal processing approach of looking at the periodogram for indications of oscillatory behavior is not without its strengths but certainly falls short in many contexts. Its strength of course is all of the theory that comes with the Fourier transform. All periodic signals can be decomposed as sums of sines and cosines, and trigonometric oscillations like sine and cosine are easy to spot as spikes in the periodogram. Issues start arising when you start having sharp, near discontinuous oscillations like a square wave or a sawtooth wave. Looking only at the periodogram, you’ll find peaks at harmonics of the fundamental frequency of these waves, a broadband effect of but a single oscillating signal. The problem in practice though is you don’t get to see the full spectrum. The frequency range you observe is limited by your sampling rate, so as an inverse problem you’re left to wonder: given this periodogram with spikes at harmonic frequencies, am I looking at a finite sum of cosines, or something else?</p>
<p>The more glaring issue is that the periodogram ignores half of the information provided by the Fourier transform, namely phase. Without it, you’ll have hard time telling the difference between an impulse–or a dirac spike–and pure white noise. That these two signals have radically different temporal structures yet qualitatively the same power spectrum should raise alarms about ignoring phase data. That phase is just as critical as amplitude is evidenced by the fact that <a href="https://en.wikipedia.org/wiki/Phase_retrieval">phase retrieval</a> is an area of active research to this day.</p>
<table>
<tbody>
<tr>
<td><img src="/assets/images/2020_voytek/dirac_vs_noise.png" alt="dirac_vs_whitenoise" /></td>
</tr>
<tr>
<td>Fig. 1: White noise and an impulse both have flat power spectra, yet their temporal structure and their phase distributions are quite different.</td>
</tr>
</tbody>
</table>
<h2 id="whats-your-favorite-anagram-of-fractal-noise">What’s Your Favorite Anagram of Fractal Noise?</h2>
<p>The examples I’ve raised so far are ultimately non-issues in the absence of noise. Even ignoring phase, if you looked at the time series data itself you could discern between sums of cosines and a square wave, an impulse and white noise. Noise obfuscates all of this, particularly neural noise. When I started working with Brad’s then Ph.D. student <a href="https://tomdonoghue.github.io/">Tom Donoghue</a>, we spent a lot of time talking about “the aperiodic component” of neuro signals. There is no precise definition of what the aperiodic component is, but empirical evidence suggests that measurements of brain activity from, say, EEG readings, tend to feature a signal component, believed to be noise, whose power spectrum obeys a simple power law
\[|\hat{x}(\omega)|^2 \propto \frac{1}{\omega^{\chi}}\]
or a multiple power law
\[|\hat{x}(\omega)|^2 \propto \frac{1}{\omega^{\chi_1}(\omega^{\chi_2} + k)}.\]</p>
<p>Mother nature seems to have a knack for power laws, so I guess this shouldn’t be terribly surprising. They appear in various laws of physics, in <a href="https://en.wikipedia.org/wiki/Pareto_distribution">hydrological data, atmospheric data, distribution of wealth and settlement sizes</a>. As a statement about power spectra, there’s nothing immediately concerning about this statement. However, under the assumption that the aperiodic component is a random process, the complexity of neural noise begins to rear its head in the autocorrelation function. Provided the time series is stationary, the autocorrelation function is a univariate function of the lag and is related to the power spectrum via the Fourier transform. As it so happens, Fourier maps power laws to power laws, and autocorrelation functions that obey a power law are known as “long memory processes.” Curiously, long memory processes are intimately related to self-similar processes which exhibit fractal like behavior.</p>
<p>By the time I’ve realized this I’ve started to realize the gravity of the problem in detecting neural oscillations. Beyond dealing with time varying oscillations like bursts and potentially non-sinusoidal oscillations, you’re mixing these signals with a fractal noise process whose autocorrelation function may not even be absolutely summable or, worse, may not even exist due to nonstationarity. Both cases are alarming because the periodogram may not even be a <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3377491">consistent estimator of the power spectrum and discrete simulations of these kinds of processes introduce a host of subtle, but critical issues</a>.</p>
<p>I’ll admit when I was building simulations for neural noise I was not fully aware of the kinds of complications and subtleties that arise based on discretization and what statistics you choose to match. For example, it’s fairly straightforward to construct a simulation of a fractional gaussian noise process whoses empirical autocorrelation function is close to the true autocorrelation function. It turns out, however, <a href="https://iopscience.iop.org/article/10.1088/0967-3334/23/1/201/meta?casa_token=RzseljC4TgQAAAAA:gPve6M-KSufbS31cGSuYfSZPUTpeI1mqFa2PpyVeVBEOKyONy_1XAxmKgKPbovcw6hykIWnx">that this introduces bias in the empirical periodogram</a>. Nearly all of the computational neuroscience literature I came across assumed without proof that time series whose periodogram exhibited a power law either fell into the fractional gaussian noise model or the fractional brownian motion model depending on the exponent of the power law, so I spent my time understanding these processes. I had a hard time finding a proof of such a claim in a mathematical context. I later found out that there is a litany of processes which also exhibit power spectra with power laws such as <a href="https://journals.aps.org/prl/abstract/10.1103/PhysRevLett.59.2503">chaotic Hamiltonian systems</a>, <a href="https://iopscience.iop.org/article/10.1209/0295-5075/26/8/003/pdf">periodically driven bi-stable systems</a>, <a href="https://journals.aps.org/prl/abstract/10.1103/PhysRevLett.117.080601">running maxima of Brownian motion</a>, <a href="https://pubs.acs.org/doi/pdf/10.1021/acs.nanolett.5b04372">ionic nanopore currents</a>, to name a few. A lot of scientific research has been done on the brain, and yet we still know so little about the generating process of neural oscillations and neural noise. Without an informing prior, it’s hard to choose a generating process for simulations.</p>
<h2 id="is-it-all-just-central-limit-theorem">Is It All Just Central Limit Theorem?</h2>
<p>Realizing how deep and cavernous this rabbit hole was, I decided to keep my simulations pretty simple. NeuroDSP now features simulations for fractional gaussian noise and fractional brownian motion. Beyond that, it also has simulations to construct signals with an arbitrary power spectrum shape by summing a bunch of weighted cosines with uniform random phase shifts. There’s nothing deep about this sum-of-sines approach, but its simplicity allows for precise parametrization of power spectra which is very useful for accuracy testing various computational neuroscience methods that claim to detect aperiodic signal statistics based off the periodogram. NeuroDSP had two other ways of simulating aperiodic noise using filtered white noise. At the end of the summer I decided to look at what the qualitative differences between these various simulations were in the Fourier domain. Much to my surprise, the differences were very hard to spot from amplitude, phase, and time series data alone.</p>
<table>
<tbody>
<tr>
<td><img src="/assets/images/2020_voytek/sines_synapse.png" alt="sines_vs_synapse" /></td>
</tr>
<tr>
<td>Fig. 2: Sum of sinusoids signal plotted against a simulation of a synaptic current. Both feature multiple power law structure in the periodogram, also known as multifractal behavior.</td>
</tr>
<tr>
<td><img src="/assets/images/2020_voytek/fgn_rotation.png" alt="fgn_vs_rotation" /></td>
</tr>
<tr>
<td>Fig. 3: Fractional gaussian noise plotted against spectrally rotated noise. The latter is generated by applying a rotation to white noise in the Fourier domain, taking the inverse Fourier transform, and taking the real part of that complex-valued time series.</td>
</tr>
</tbody>
</table>
<p>Part of this was by construction I guess. The fact that all of the phase distributions came out uniform was a bit of a surprise, at least in the fractional noise settings. I couldn’t find any resources that gave analytic distributions for the phase of fractional gaussian noise or fractional brownian motion, but I didn’t look too deeply since I was trying to stay away from rabbit holes with the limited time I had. I can see it for fractional gaussian noise since having an absolutely summable autocorrelation function ensures that the coordinates in the Fourier series are approximately independent complex gaussians because of central limit theorem. Of course, it could be the case that in the setting where the autocorrelation function is not absolutely summable that a generalized central limit theorem is kicking in and the limiting distribution is some Levy stable distribution that is not gaussian.</p>
<p>By the time my contract with Voytek Lab has ended, I’m left with more questions about long memory processes than I came in with. The implications of understanding these kinds of stochastic processes better is more than just a mathematical breakthrough as is now clear to me. That’s a big reason why I like working with other scientists. Beyond the excitement of collaborating with other people, it’s easy to slip into the misleading mentality that most of the practical issues of math have been sorted out and the rest is just engineering. After all, humanity has had a long time to think about math ever since the Babylonians stumbled upon arithmetic. This is particularly true if you spend most of your time thinking about proving theorems about other people’s problems like phase retrieval, blind deconvolution, and deep learning all while staying behind the closed doors of the math department. Standing in stark contrast to this is the wild west of fields like Cognitive Science which are quite young in comparison, incredibly interdisciplenary, and asking fundamental questions about how our brains work. If anything, the fact that cognitive science is grappling with complex problems like fractal time series should be a call for more collaboration between mathematicians and other fields of science.</p>Eric LybrandI was invited to work with Brad Voytek’s lab in the Cognitive Science department during the summer of 2020. Voytek Lab consists of 16 lab members ranging from undergraduate researchers to postdocs. Their interests are varied, but they primarily study neural oscillations from a computational neuroscience and experimental approach. In particular, they’re interested in how to quantify various statistics about neural oscillations and discerning whether certain statistics correlate with physiological traits.An Overview of Blind Deconvolution2019-06-02T00:00:00-07:002019-06-02T00:00:00-07:00https://elybrand.github.io//blind_deconvolution<h2 id="motivation">Motivation</h2>
<p>I remember as a first year student at UGA I was enrolled in a multivariable mathematics
class. At the time that class was impossibly difficult for me, but looking back I’m
grateful for enrolling in it. A lot of key ideas that I would see later in my academic
career had seeds that were planted during that first year. As a matter of posterity, the lectures of this course were recorded and made <a href="https://www.youtube.com/channel/UCp9W-et2Zbx7u5_VMiXGtPQ">publicly available on YouTube</a> courtesy of Patty Wagner, yours truly, and a few of my then colleagues. The professor of that course,
<a href="https://www.math.uga.edu/directory/people/theodore-shifrin">Ted Shifrin</a>, who is a good friend and mentor of mine now, is famously known by his students for incorporating geometry to a mathematical discussion whenever he finds the chance to do so. An extreme example of this is demonstrated by an Abstract Algebra text he authored titled “Abstract Algebra: A Geometric Approach.”</p>
<p>I particularly remember working on some homework problems involving conic sections and coming across
a problem which outlined the mathematics behind what are known as “<a href="https://www.mathematicalsynergistics.com/gallery/2d-whispering-gallery/">whispering galleries</a>.”
Effectively these are rooms in the shape of an ellipse where two speakers can whisper to
one another from across the room and hear each other perfectly clear. This is possible due
to the nature of how sound waves propagate and reflect off surfaces. If you click on the
link, you’ll see a neat gif illustrating this phenomenon.</p>
<p>It’s worth re-emphasizing that whispering galleries only work due to the very special geometry of the room. Anyone who has tried having a conversation with a friend or colleague from across any normal room knows that a whisper likely isn’t going to be heard. On the other extreme one can imagine being inside a large vacant room and shouting towards the other person. The problem with this latter scenario is that, depending on the material that the room is composed of, you’re likely going to hear a lot of echoes which makes discerning what was said difficult.</p>
<p>If you’ll humor me, imagine someone gives you a recording of someone talking in some unknown room where there are lots of echoes and reverberations and asks you what the person was saying. You might be able to piece together bits of what was said but if the recording is so poor due to the echoes you can imagine that the recording might be entirely uninterpretable. This problem of getting rid of echoes is formally known as <em>blind deconvolution</em>.</p>
<p>The above example may seem artificial but, in fact, the scope of problems which can be classified as blind deconvolution is expansive. One important example is medical and astronomical imaging where the goal is to deblur a noisy image. Another example is communications engineering such as for the Internet of Things and for 5G networks where multiple devices are communicating with a single base station.</p>
<h2 id="a-mathematical-model">A Mathematical Model</h2>
<p>Let’s return to the example of a person speaking to a single microphone in a room with poor acoustic design. As the person speaks the sound waves from the speech emanate across the room, echoing off walls, and after some time reach the microphone. Let’s call the function encoding the amplitude of the sound wave of speech \(x(\cdot)\), where the argument of the function is time.</p>
<p>Now, if the microphone and the person were in a room with no wall then the microphone at time \(t\) would simply hear \(x(t)\) or perhaps a delayed signal \(x(t-\tau)\) accounting for the time it takes for the sound wave to reach the microphone. When we add walls the picture is much more complicated.</p>
<table>
<tbody>
<tr>
<td><img src="/assets/images/2019_blind_deconvolution/bdc_nowall.jpg" alt="nowall" /></td>
</tr>
<tr>
<td>Fig. 1: A cartoon illustrating how sound propagates in free space.</td>
</tr>
<tr>
<td><img src="/assets/images/2019_blind_deconvolution/bdc_1wall.jpg" alt="1wall" /></td>
</tr>
<tr>
<td>Fig. 2: A cartoon illustrating how sound propagates in a space with one wall. Notice there will be a single echo off this wall reflecting back to the microphone.</td>
</tr>
<tr>
<td><img src="/assets/images/2019_blind_deconvolution/bdc_4wall.jpg" alt="4wall" /></td>
</tr>
<tr>
<td>Fig. 3: A cartoon illustrating how sound propagates in an enclosed room with a rectangular layout. Only the direct propagation and “single echo” propagations are sketched, but in general sound can echo off many walls before reaching the microphone. This is the infamous multipath propagation phenomenon that made the <a href="https://elybrand.github.io//indoor_localization/">indoor localization project</a> so challenging.</td>
</tr>
</tbody>
</table>
<p>When sound waves echo off a surface they will attenuate or dampen as a function of the material they echo off of. Let’s model that by multiplication by some scalar. That is, after reflecting off of some fixed wall the echoed sound wave is now \(a \cdot x(\cdot)\). So if the person and the microphone were in an open space with one wall nearby then at time \(t\) the microphone would hear the sound travelling directly from the speaker \(x(t)\) but also the sound coming from the echo off the wall of an earlier part of speech \(a \cdot x(t-\tau)\), where the shift comes from the time it takes the sound wave to travel the extra distance from the speaker to the wall and the wall to the microphone. In other words, if we denote the amplitude of the soundwave received by the microphone at time \(t\) as \(y(t)\), we have \(y(t) = x(t) + ax(t-\tau)\).</p>
<p>Now when there are multiple walls there are many ways for the sound wave \(x(\cdot)\) to echo off and reflect back to the microphone. Technically speaking the sound wave could echo off an arbitrarily large number of walls before reaching the microphone. In effect that means at time \(t\) the microphone would receive</p>
<p>\[ \begin{align}
y(t) = \sum_{i=1}^{n(t)} a_i x(t - \tau_i).
\end{align}\]</p>
<p>I’m a digital signal processor, so let’s assume that we’ve discretized time and that our original signal \(x(\cdot)\) is now represented by a vector in \(\mathbb{R}^L\). If we store our attenuation coefficients, or our filter, in a vector \(a \in \mathbb{R}^L\) then we can rewrite the equation above as</p>
<p>\[ \begin{align}
y_j = \sum_{i=1}^{L} a_i x_{i\ominus j} = (a * x)_j,
\end{align}\]</p>
<p>where \(\ominus\) denotes circulant subtraction modulo \(L\) and \(*\) denotes circulant convolution.</p>
<p>So this explains part of the reason why this problem is called blind deconvolution. The reason for the world “blind” is because a priori <em>we know neither \(a\) or \(x\)</em>! We don’t know \(a\) because we don’t have information regarding the composition of the room, and we don’t know \(x\) because that’s the speech we’re trying to retrieve from the recording \(y\).</p>
<h2 id="a-compressed-sensing-approach">A Compressed Sensing Approach</h2>
<p>Whenever mathematicians see convolutions immediately their spidey senses tingle telling them that the natural thing to do is take a Fourier transform. Indeed, any first year math graduate student who has taken real analysis can tell you that the Fourier transform turns convolution into point-wise multiplication, namely</p>
<p>\[ \begin{align}
\hat{y}_j := (Fy)_j = (Fa)_j (Fx)_j =: (\hat{a} \odot \hat{x})_j.
\end{align}\]</p>
<p>Even with this nice reformulation of the problem it’s still viciously ill-posed for a couple of reasons. First, if \(\alpha\) is any non-zero complex number, then \(\hat{y} = (\alpha \hat{a}) \odot (\alpha^{-1} \hat{x})\). This ambiguity is inevitable so the best we could do is hope to recover our vectors up to scaling.</p>
<p>More concerning is the fact that we have \(L\) equations yet \(2L\) unknowns. One might expect that making sparsity assumptions on \(a, x\) would resolve the ill-posedness of the problem. However, this opens another can of worms. The set of \(k\)-sparse vectors in the canonical basis is invariant under shifting of coordinates and these shifts viciously cooperate with the convolution operator in the following way. If \(S_{\tau}\) denotes the circulant shift-by-\(\tau\) operator, then</p>
<p>\[ \begin{align}
y_j = \sum_{i=1}^{L} a_i x_{i\ominus j} = \sum_{i=1}^{L} a_{i\ominus \tau} x_{i\ominus (j+\tau)} = (S_{\tau} a * S_{\tau}x)_j.
\end{align}\]</p>
<p>In other words, under canonical-basis sparsity assumptions we can only hope to recover our vectors up to scaling and circulant shifts.</p>
<p>It’s for these reasons that it is often assumed \(a, x\) lay in known low-dimensional subspaces, i.e. \(\hat{a} = Bh\) and \(\hat{x} = Aw\), where \(B \in \mathbb{C}^{L\times K}\) and \(A \in \mathbb{C}^{L\times N}.\) Notice now that under these assumptions and letting \(b_{\ell}, a_{\ell}\) denote the rows of \(B,A\) respectively then</p>
<p>\[ \begin{align}
\hat{y}_j := b_{\ell}^* h w^* a_{\ell} = \langle hw^* , b_{\ell}a_{\ell}^* \rangle,
\end{align}\]</p>
<p>where the above inner product is the Frobenius or Hilbert-Schmidt inner product. In other words, we’ve now realized the blind deconvolution problem as a linear inverse problem for the rank 1 matrix \(hw^*\). Low-rank matrix recovery is by now a well-studied problem, and compressed sensing tells us that a rank 1 \(K\times N\) matrix can be recovered from \(O(K+N)\) random linear measurements with high probability by solving</p>
<p>\[ \begin{align}
\min_{Z} \quad & \|Z\|_* \newline
\textrm{s.t.} \quad & \langle A_j, Z \rangle = \langle A_j, X \rangle, \,\, \text{for all } j.
\end{align}\]</p>
<p>Here \( \|\cdot\|_*\) denotes the nuclear norm of a matrix and the \(A_j\) are typically random matrices, e.g. Gaussians. Our problem doesn’t quite fit this mold, however. Indeed, even under assumptions of randomness on our vectors \(b_{\ell}, a_{\ell}\) we have to account for the fact that we are given <em>rank 1</em> linear “measurements”. Analogues of the compressed sensing result above exist in this setting, fortunately.</p>
<p>The reason I write “measurements” is because the blind deconvolution problem is different from the usual signal acquisition paradigm that I work under in that we aren’t really acquiring measurements of the matrix \(hx^*\). Rather, we’re assuming something about the embedding dimension of our matrices \(A, B\) encoding the subspace constraints. More succinctly, we trade oversampling for over-parametrization.</p>
<p>As usual, a lot of work goes into relaxing the randomness assumptions. It seems unrealistic to assume that our original signals \(a, x\) are, say, Gaussian embeddings of lower dimensional vectors. To the best of my knowledge, the best that we can currently do is assume that one of the embedding matrices, say \(A\), is Gaussian and the other satisfies some incoherence properties with the vector \(h\). Recall that our measurements are of the form \(b_{\ell}^* h x^* a_{\ell}\). If we are to allow deterministic vectors \(b_{\ell}\), then we have to ensure that they capture enough “information” about \(h\). For example, we’d be toast if the vectors \(b_{\ell}\) all happened to be orthogonal or near-orthogonal to the vector \(h\). For this reason the amount of over-parametrization needed to recover \(hx^*\) necessarily depends on the parameter \( \mu_{h}^2 := \max_{\ell \in [L]} | \langle b_{\ell} , h \rangle |^2 \). Intuitively speaking the larger \(\mu_{h}^2\) is the less diffuse \(Bh\) is.</p>
<h2 id="existing-work-and-outstanding-questions">Existing Work and Outstanding Questions</h2>
<p>To the best of my knowledge, the first people to cast blind deconvolution in the compressed sensing context were <a href="https://arxiv.org/pdf/1211.5608.pdf">Ali Ahmed, Ben Recht, and Justin Romberg</a>. They cast the problem as we saw above using nuclear norm minimization. The perks of this is that we can import compressed sensing results and use a convex program to get exact recovery of \(hw^*\) with high probability provided \(L \geq O(\mu_h^2 (K+N) \log^3(L))\).</p>
<p>While the reconstruction algorithm is easy to formulate, for problems where \(L\) is large running nuclear norm minimization can be quite slow. Mimicking the approach of the Wirtinger Flow algorithm, <a href="https://arxiv.org/pdf/1606.04933.pdf">Thomas Strohmer and coauthors</a> proved that one may recover \(hw^*\) when \(A\) is Gaussian and \(B = F\,\,\begin{bmatrix}I_K &0\end{bmatrix}^*\) using a non-convex penalized least squares algorithm provided \(L \geq O(\mu_h^2(K+N)\log^2(L))\). I should add that it was Strohmer who first introduced me to this problem when he came to give a talk at UCSD back in the spring of 2018. This paper blends together a lot of very powerful tools in statistical/non-convex optimization, and high dimensional probability. For a nice survey of these techniques in the low-rank matrix setting you can consult <a href="https://arxiv.org/pdf/1711.10467.pdf">this paper</a> by Yuxin Chen and coauthors.</p>
<p>The blind deconvolution literature is still fairly young and I have undoubtedly missed citing many other relevant works and recent developments. Ali Ahmed, John Wright, and Paul Hand come to mind for researchers who have contributed to the blind deconvolution literature. Nevertheless, I want to end this post by outlining some of the outstanding challenges that our current understanding faces.</p>
<p>To the best of my knowledge, every work on the blind deconvolution problem as described above crucially relies on either one or both of the matrices \(A, B\) being Gaussian. Extending the kinds of random ensembles in our model appears to be incredibly non-trivial, as many special features about the Gaussian distribution are used particularly in the non-convex approaches. Furthermore, unlike in the signal acquisition setting where a practitioner might have control over the measurement matrices, the over-parametrized model in blind deconvolution is forced upon us by the nature of the signals in question. More realistic models for our matrices might stem from random wavelet subspaces, or random Fourier subspaces. The problem with these approaches is that these assumptions say something about the <em>columns</em> of our matrix \(A\). If our random matrix \(A\) doesn’t enjoy having independent rows then it’s not clear how compressed sensing techniques are applicable. Despite this, numerical experiments suggest that Strohmer’s algorithm performs well with random wavelet subspaces.</p>
<p>Another issue that I have not yet seen addressed in the literature is how stable the estimates for \(hx^*\) are to model mismatch. Indeed, under this paradigm it’s assumed we know the subspaces parametrized by \(A,B\) exactly. I imagine the non-convex approaches will be particularly sensitive to this, as these rely on a clever initialization using the adjoint of our “measurement operator” acting on \(hw^*\). Proving that these initializations lay in an appropriate basin of attraction around the ground truth usually requires an absurdly large amount of over-parametrization despite the critical lower bound on \(L\) being asymptotically tight in terms of \(K, N, \mu_h^2\). Strohmer et al.’s numerical experiments suggest the gap between how large \(L\) needs for the reconstruction to succeed empirically versus how large it needs to be for the theorems to kick in is not negligible.</p>
<p>In fact, I wonder if the spectral initialization they use is necessary. <a href="https://www.princeton.edu/~yc5/publications/random_init_PR.pdf">Recent work</a> by Yuxin Chen and coauthors shows that in the phase retrieval setting one may recover the signal in question from Gaussian measurements using random initialization and gradient descent on a non-convex least squares loss function. The sheer size of this manuscript goes to show how difficult analyzing bilinear inverse problems with non-convex optimization are even under “nice” randomness assumptions.</p>Eric LybrandMotivation I remember as a first year student at UGA I was enrolled in a multivariable mathematics class. At the time that class was impossibly difficult for me, but looking back I’m grateful for enrolling in it. A lot of key ideas that I would see later in my academic career had seeds that were planted during that first year. As a matter of posterity, the lectures of this course were recorded and made publicly available on YouTube courtesy of Patty Wagner, yours truly, and a few of my then colleagues. The professor of that course, Ted Shifrin, who is a good friend and mentor of mine now, is famously known by his students for incorporating geometry to a mathematical discussion whenever he finds the chance to do so. An extreme example of this is demonstrated by an Abstract Algebra text he authored titled “Abstract Algebra: A Geometric Approach.”Indoor Localization with Wireless Networks2018-08-21T00:00:00-07:002018-08-21T00:00:00-07:00https://elybrand.github.io//indoor_localization<p>During the summer of 2018, I had the opportunity to work for the
telecommunications corporation NEC in Sendai, Japan on a project that
focused on indoor localization using wireless networks. This partnership
was made possible by IPAM’s <a href="https://www.ipam.ucla.edu/programs/student-research-programs/graduate-level-research-in-industrial-projects-for-students-grips-sendai-2018/">GRIPS</a>
program which pairs graduate
students in mathematics with corporations in places like Berlin, Germany
and Sendai, Japan to solve problems in biotechnology, transportation,
and telecommunications. Unlike the usual signal processing paradigm I
work in where the aim is to recover a signal given certain measurements,
localization focuses on recovering the location of where a signal is
transmitted given certain data. Most people are familiar with Global
Positioning System (GPS) which makes navigating roadways with Google
Maps possible. Most people are also undoubtedly aware of the
shortcomings of GPS in sheltered environments like parking decks or
office buildings. The reason this is the case is that GPS relies on your
smartphone having line of sight, or an unobstructed view, of at least
four satellites used for GPS. When this is not the case, localization
error can be several meters large.</p>
<p>Certainly for most navigation purposes GPS suffices, but it’s also easy
to imagine instances when we need navigation in environments where our
smartphone does not have line of sight with any GPS satellites. For
example, you might wish to find your favorite store in a large indoor
shopping complex. The applications of indoor localization go far beyond
this simple example though. Advertisers in particular are interested in
accurate localization estimates for creating localized advertisements.
Emergency response teams are also interested in localization, as they
could find victims trapped inside a partially destroyed or obstructed
building. Perhaps the most impactful application of indoor localization
is with the internet of things (IoT). NEC is most interested in this
setting, as they work with security firms and production facilities who
are interested in automating certain tasks.</p>
<p>There is a rich literature on indoor localization. <a href="https://ieeexplore.ieee.org/document/4343996/">Liu</a> and
<a href="https://ieeexplore.ieee.org/document/7782316/">Davidson</a> give a nice summary of what has been done so far. A
non-exhaustive list of the most prominent data that one can use for
localizing includes time of flight (ToF), angle of arrival (AoA), and
received signal strength (RSS). ToF is, as you might expect, the time
it takes for the signal to travel from the transmitting device to the
receiver. AoA is the angle that the transmission makes relative to the
receiving antenna(s). Received signal strength is, as the name suggests,
a measure of signal strength that the receiver obtains from a particular
transmission, typically from a wireless access point or router. There
are other technologies that one can use such as <a href="https://www.usenix.org/system/files/conference/nsdi16/nsdi16-paper-vasisht.pdf">channel state
information</a> or Bluetooth. However, current wireless network
infrastructure limits our options down to RSS. RSS is ubiquitous because
any device adhering to the IEEE 802.11 protocol has RSS data embedded in the
<a href="https://en.wikipedia.org/wiki/Network_packet">packets</a>, or the basic unit of wireless communication, it sends.</p>
<p>The problem with RSS is that it is an incredibly sensitive variable.
First and foremost, it’s not a standardized measurement in terms of
units. Different devices can have different units for RSS. Further
complicating matters is that signals transmitted in indoor environments
are prone to a variety of environmental factors which can distort RSS at
a surprising scale. <a href="https://en.wikipedia.org/wiki/Multipath_propagation">Multipath propagation</a>, or the phenomenon
where a signal “bounces” off the walls, can cause a signal to interfere
with itself in non-obvious ways. Not only does it depend on the geometry
of the indoor environment, but the composition of materials also
matters. Different materials can cause a signal to attenuate at
different rates. This is particularly important when a signal
transmission does not have line of sight with a receiving antenna. How
the transmitting antenna is oriented, or how you hold your smartphone,
with respect to the receiving antenna also has a non-trivial impact on
RSS values.</p>
<p>To the best of my knowledge, there are two methods using RSS which allow
one to estimate the location of a signal transmission. The first is a
method known as fingerprinting. Fingerprinting is a method which
consists of an offline phase and an online phase. During the offline
phase, RSS measurements are recorded at predetermined locations called
reference points in a particular indoor setting. These measurements
along with their distance to each wireless access point are then stored
in a database which is sometimes referred to as a “radio map”.
During the online phase, RSS values are compared to those
stored in the database and the position of a transmission is estimated
based on which database entry is most similar to the online measurement.
These measurements are understandably very tedious and expensive to
collect. Further, the database of RSS measurements degrades in quality
as the indoor environment changes. Nevertheless, fingerprinting appears
to be the state of the art in terms of localization methods which rely
solely on RSS.</p>
<p>The other paradigm of localizing with RSS includes approaches which use path
loss models. Path loss models are motivated by the physics of signal
attenuation in simple environments. To give a specific example, the
Friis free space equation says that the power of a transmission with no
multipath propagation and with line of sight decays according to the law
\[ \begin{align} P_{loss} = \left(\frac{c}{4\pi f d}\right)^2,
\end{align}\] where \(c\) is the speed of light, \(f\) is the
frequency of the transmission, and \(d\) is the distance the
transmission travels. More general path loss models might incorporate
random variables to model signal <a href="https://en.wikipedia.org/wiki/Fading">fading</a> or might adjust the
exponent in the above expression to account for cases when there is no
line of sight or when there is multipath propagation. For reasonable
devices, such as the Raspberry Pi 3, where RSS is a function of the
signal power it’s easy to use this formula to estimate the distance a
transmission from a wireless access point travels. The simplicity of
path loss models is both a blessing and a curse. First and foremost,
they require very little calibration to specific indoor environments.
Distance estimation is also very easy to compute with just a little bit
of algebra. However, path loss models often understate the complexity of
indoor environments and consequentially suffer localization errors of
several meters.</p>
<h2 id="a-dynamic-path-loss-model">A Dynamic Path Loss Model</h2>
<p>The laundry list of factors which influence RSS measurements along with
the difficulty of getting sub-meter localization error with the above
methods has led many researchers to conclude that the future of indoor
localization will likely rely on additional forms of data to estimate a
transmission’s position. Nevertheless, NEC tasked my colleagues and me
with producing a new approach to indoor localization using RSS
measurements in just 2 short months. Our group recognized that
fingerprinting was ultimately an unsustainable solution so we decided to
focus on improving the shortcomings of path loss models. In our case,
NEC provided us with Raspberry Pi 3’s and a particular router which gave
us RSS measurements in dBm. This was nice since signal power and RSS in
this case are related by the simple transformation \(RSS =
10\log_{10}(P)\). Skipping some details which you can find in the
surveys linked to above, this meant that our basic path loss model at time \(t\) is of
the form
\[ \begin{align} RSS(t) = T_x - 10n\log_{10}(d(t)), \end{align}\]
where \(T_x\) is the transmission power in dBm from 1 meter’s
distance and \(n\) is the path loss exponent which is to be chosen
based on the indoor environment. Engineers seem to have some <a href="http://www.wirelesscommunication.nl/reference/chaptr03/indoor.htm">rule of
thumb</a> which governs how to choose \(n\) in various contexts.
However, existing models choose \(n\) once and it is fixed thereafter.</p>
<p>Since indoor environments often consist of many rooms with different
layouts, we conjectured that a path loss model which dynamically adjusts
its path loss exponent may better capture the complexity of certain
indoor settings. The question of when and how to adjust the path loss
exponent was based on some experiments that we conducted in our office.
Our office consisted of six rooms which we partitioned into three
sections named S1, S2, and S3 listed in decreasing order based on their
average distance to the lone access point, or router, we had (pardon the pun) access
to. We laid a uniform grid of reference points spaced 2
meters apart from each other and collected measurements on a subset of
these reference points. Below are a few figures which illustrate some of
the paths that we walked, indicated in red arrows, and the recorded RSS
measurements at various times. In the figures, AP is the location of the
access point and RP is the location of the Raspberry Pi that we used to
monitor, or “sniff”, packets. The other Raspberry Pi was held in our
hands as we walked along the paths.</p>
<table>
<thead>
<tr>
<th style="text-align: center">Route</th>
<th style="text-align: center">RSS Measurements</th>
<th> </th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: center"><img src="/assets/images/2018_indoor_localization/Route_180723_S1.png" alt="S1path" /></td>
<td style="text-align: center"><img src="/assets/images/2018_indoor_localization/S1_jumps.png" alt="S1RSS" /></td>
<td> </td>
</tr>
<tr>
<td style="text-align: center"><img src="/assets/images/2018_indoor_localization/Route_180713_S2.png" alt="S2path" /></td>
<td style="text-align: center"><img src="/assets/images/2018_indoor_localization/raw_RSSI_S2.png" alt="S2RSS" /></td>
<td> </td>
</tr>
<tr>
<td style="text-align: center"><img src="/assets/images/2018_indoor_localization/Route_180723_S3.png" alt="S3path" /></td>
<td style="text-align: center"><img src="/assets/images/2018_indoor_localization/S3_jumps.png" alt="S3RSS" /></td>
<td> </td>
</tr>
</tbody>
</table>
<p>The first two figures substantiate the claim that RSS is indeed a rather
volatile beast. We were alarmed to see that readings could drop
as low as -60dBm just a few meters from the access point. These
measurements were as low if not lower than those we recorded in the room
most far removed from the access point. Further, the drops in RSS in the
first plot were due simply to obstructing the line of sight by a
computer monitor and one of our colleague’s bodies!</p>
<p>The model our group ultimately decided upon adjusts the path loss
exponent at times when the RSS curve experienced sharp jumps. This
dynamic model can be broken down into two pieces, namely jump detection
and distance estimation.</p>
<h2 id="jump-detection">Jump Detection</h2>
<p>The hardest part of jump detection is defining what a jump is. I’m not
sure we ever decided on a formal definition. A formal definition would
likely be overly restrictive or too vague to be useful anyways. Me being
me though, <a href="https://en.wikipedia.org/wiki/Wavelet">wavelet coefficients</a> were the first candidates for
detecting discontinuities in RSS of various sizes and over various time
intervals. The above plots demonstrate that RSS naturally fluctuates
about some piecewise-smooth curve governed by the path a user takes while
holding their Raspberry Pi. To ignore these small and spurious jumps we
needed some way of denoising our signal. Our first guess was to apply
some low-pass filters to our data, but the problem with applying
low-pass filters is that we have to choose a stop band, or which
frequencies to treshold out. Unfortunately, when we compared the power
spectra of various RSS curves it appeared to be the case that the
severity of the signal’s non-line of sight strongly influences the
spectrum of our signal. This suggested that we might have to choose a
dynamic stop band which would further add to our quickly growing list of
parameters to choose.</p>
<p>The simplest and most effective model for denoising ended up being
<a href="https://academic.oup.com/biomet/article/81/3/425/256924">Donoho’s and Johnstone’s <em>RiskShrink</em></a> which shrinks empirical
wavelet coefficients under the assumption that the signal is perturbed
by Gaussian noise. A comparison of filtered and unfiltered RSS curves is shown
below. For this example, wavelet coefficients at scale \(k=4\) were filtered.
Since this experiment was sampled at a rate of \(2\)Hz, \(k=4\) corresponds to a
time scale of 8 seconds.</p>
<p><img src="/assets/images/2018_indoor_localization/S1_1_filtered.png" alt="S1filtered" /></p>
<p>Next we needed a procedure for selecting significant wavelet coefficients from
the filtered RSS curve which we would label as jumps. We used the difference between
the unfiltered and filtered RSS curves as a model for noise and chose the
\(1-\alpha\) percentile of the noise’s wavelet coefficients as the threshold for
significant wavelet coefficients. Choosing \(\alpha\) by cross validation is probably
the best option but we had so few experiments–39 to be exact–that we stuck with a value
of \(\alpha = 20\). In practice this seemed to work quite well on the handful of experiments
we conducted where we carefully monitored the transitions between line of sight and
non-line of sight settings. Below are two figures which
plot RSS measurements from a particular experiment along with the
wavelet coefficients of the filtered and unfiltered RSS curves as well as the
threshold plotted in red.</p>
<p><img src="/assets/images/2018_indoor_localization/S3_close_open_RSS.png" alt="S3RSSfilt" />
<img src="/assets/images/2018_indoor_localization/S3_close_open_wave_coeffs.png" alt="S3coeffs" /></p>
<h2 id="comparing-dynamic-model-to-static-model">Comparing Dynamic Model to Static Model</h2>
<p>We could have spent more time on perfecting jump detection, but with what little time
we had remaining we decided to move onto estimating distance from our access point using
our dynamic model. Recall that our model assumes
\[ \begin{align} RSS(t) = T_x - 10n\log_{10}(d(t)), \end{align}\]
where \(n = n(t)\) is really a function of previous RSS measurements. We essentially
adjust \(n\) linearly with respect to the net change in RSS during a jump. That is, for some predetermined
\(\beta\) and an interval \([t_0, t_1]\) over which a jump occurs we set
\[ \begin{align} n(t_1) = n(t_0) - \beta(RSS(t_1) - RSS(t_0))\end{align}.\]
Other than \(\beta\), one has to choose an initial pass loss exponent.</p>
<p>To compare how much of an improvement the dynamic model offers over static path loss
models, or those with a fixed path loss exponent, we initialized both dynamic and static path loss
models with a path loss exponent that was chosen by cross-validation under \(\ell_2\)
loss. More formally, letting \(f_i, g_i\) denote the vectors of predicted and
true distances respectively with length \(m_i\), we define the risk to be
\[ \begin{align}
\sigma &= N^{-1} \sum_{i=1}^{N} L(f_i, g_i),\newline
L(f_i, g_i) &= \left(m_{i}^{-1} \sum_{j=1}^{m_i} (f_i(j) - g_i(j))^2\right)^{1/2}
\end{align}.\]</p>
<p>We were initially surprised when we trained our dynamic path loss model that it offered
a meager improvement in risk over the static model by \(7\) millimeters. Upon further
reflection we realized that this is likely due to the fact that our experiments are a bias
data set. Only 6 of our 39 experiments had a path that transitioned between regions of
good and poor signal coverage. All 6 of those regions happened in S2 of our office.
See the below figure for a heat map–generated by an algorithm developed by my colleagues Shizuki Goto, Ryoichiro Hayasaka, and Hannah Horneh–
of wireless signal strength as well as a precise definition of what S2 of our office is.</p>
<div style="text-align:center">
<p><img src="/assets/images/2018_indoor_localization/S2_coverage.png" alt="S3RSSfilt" height="50%" width="50%" style="text-align:center" /></p>
</div>
<p>When we retrained our dynamic model on each section of our office, we found that the dynamic
model offered the following improvements in risk over the static model.</p>
<table>
<thead>
<tr>
<th style="text-align: center">Region</th>
<th style="text-align: center">Static Risk</th>
<th style="text-align: center">Dynamic Risk</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: center">S1</td>
<td style="text-align: center">1.44m</td>
<td style="text-align: center">1.40m</td>
</tr>
<tr>
<td style="text-align: center">S2</td>
<td style="text-align: center">3.71m</td>
<td style="text-align: center">2.19m</td>
</tr>
<tr>
<td style="text-align: center">S3</td>
<td style="text-align: center">1.13m</td>
<td style="text-align: center">0.95m</td>
</tr>
</tbody>
</table>
<p>The actual performance is of course a bit more nuanced than what these estimates for the
risk suggest. While the dynamic model on average performs better than the static model in
terms of risk, it’s worth illustrating some extreme examples to show what works and what
needs improving. Take a look at the following experiments along with the predicted and
true distances.</p>
<table>
<thead>
<tr>
<th style="text-align: center">Route</th>
<th style="text-align: center">Distance Estimation</th>
<th> </th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: center"><img src="/assets/images/2018_indoor_localization/Route_180723_S1.png" alt="S1dist" /></td>
<td style="text-align: center"><img src="/assets/images/2018_indoor_localization/S1_distance_est_ok.png" alt="S1RSS" /></td>
<td> </td>
</tr>
<tr>
<td style="text-align: center"><img src="/assets/images/2018_indoor_localization/Route_180713_S2.png" alt="S2path" /></td>
<td style="text-align: center"><img src="/assets/images/2018_indoor_localization/S2_distance_est_good.png" alt="S2RSS" /></td>
<td> </td>
</tr>
<tr>
<td style="text-align: center"><img src="/assets/images/2018_indoor_localization/Route_180723_S3.png" alt="S3path" /></td>
<td style="text-align: center"><img src="/assets/images/2018_indoor_localization/S3_distance_est_bad.png" alt="S3RSS" /></td>
<td> </td>
</tr>
</tbody>
</table>
<p>For the experiment in the first figure,
the Raspberry Pi was held above one of our colleague’s heads to maintain line of sight with
the access point. As expected, the dynamic model deviates very little from the
static model. For the experiment in the second figure, our jump detection algorithm correctly
detects a sharp jump due to a room change. Because the wireless coverage in each room are
dramatically different, adjusting the path loss exponent leads to a much more accurate distance
prediction. Finally, the experiment in the third figure never obtains line of sight.
There are some small jumps that occur due to orienting the Raspberry Pi in the direction of
the access point and changing rooms. However, our jump detection algorithm misses a jump
around 55 seconds which causes an error in the dynamic model’s distance estimation. Unfortunately
this error propagates forward in time.</p>
<h2 id="concluding-remarks">Concluding Remarks</h2>
<p>It appears then, perhaps unsurprisingly, that our dynamic path loss model is only as good
as the jump detection algorithm. I think there are improvements that could be made to make
the advantages of the dynamic path loss model more pronounced. In particular, I think
having more access points and therefore more RSS curves could allow coordinated jump
detection. Since the dynamic path loss model improves upon the static model when a transmitter
crosses the boundary of “good coverage” for an access point, having more boundaries seems like
an obvious way to increase the performance of the dynamic model. Besides, for most commercial
indoor settings there will be several access points scattered throughout.</p>
<p>Another improvement that could be made is detecting jump across finer time scales.
For simplicity our jump detection algorithm only looked at classifying
wavelet coefficients at one scale, roughly corresponding to 5 seconds. Looking at a variety of time scales might prevent the
jump detection algorithm from missing jumps. The caveat to this of course is that the signal
to noise ratio at smaller scales will be much smaller than at higher scales.</p>
<p>For anyone who is interested in the details of this project as well as the code used for
jump detection and distance estimation you may find a project write up as well as MATLAB
code on my GitHub account <a href="https://github.com/elybrand/2018Sendai_Indoor_Localization">here</a>.</p>
<p>After working on this project for two months, I’m inclined to agree with the existing literature that
the future of indoor localization will require methods that rely on additional forms of data other than RSS.
There appear to be fundamental limitations to how much information can be extracted from RSS. While I think
methods like dynamic path loss models can mitigate the error due to erratic discontinuities in RSS measurements,
fingerprinting and path-loss models are far from sub-meter accuracy which is required in certain settings
like automation in factory settings.</p>Eric LybrandDuring the summer of 2018, I had the opportunity to work for the telecommunications corporation NEC in Sendai, Japan on a project that focused on indoor localization using wireless networks. This partnership was made possible by IPAM’s GRIPS program which pairs graduate students in mathematics with corporations in places like Berlin, Germany and Sendai, Japan to solve problems in biotechnology, transportation, and telecommunications. Unlike the usual signal processing paradigm I work in where the aim is to recover a signal given certain measurements, localization focuses on recovering the location of where a signal is transmitted given certain data. Most people are familiar with Global Positioning System (GPS) which makes navigating roadways with Google Maps possible. Most people are also undoubtedly aware of the shortcomings of GPS in sheltered environments like parking decks or office buildings. The reason this is the case is that GPS relies on your smartphone having line of sight, or an unobstructed view, of at least four satellites used for GPS. When this is not the case, localization error can be several meters large.Analog to Digital Conversion & \\(\Sigma\Delta\\) Quantization2018-03-29T00:00:00-07:002018-03-29T00:00:00-07:00https://elybrand.github.io//quantization<p><em>Update: an earlier version of this post incorrectly characterized the contribution from Daubechies and DeVore.
The family of quantizing functions \(\rho_r\) that they use is not the greedy scheme detailed below but is
slightly more complicated. A special thanks goes out to Rayan Saab for a conversation we had about this detail.</em></p>
<p>A necessary step in the signal processing pipeline involves taking analog signals, or, mathematically speaking,
real valued functions defined on \(\mathbb{R}\) and representing them digitally as bit strings. We take this for
granted when we listen to a song on our phone or watch a movie on our computer. It’s the kind of thing that some mathematicians
like to ignore and leave for the engineers to figure out. And yet, there’s a lot of serious math that goes on
behind the scenes to make sure that song you’re listening to doesn’t go haywire as your phone reads through the bitstream
encoding it.</p>
<p>To set things up, suppose we have a function \(f: [-T, T] \to \mathbb{R}\) which represents
some natural signal we’d like to encode. Since humans can only see wavelengths between
\(390\) and \(700\) nm and can only hear frequencies between 20 and 20,000 Hz it’s safe
to assume \(f\) is <em>bandlimited</em>, or that its Fourier transform is compactly supported
in some set \([-\Omega, \Omega]\). There is a classic result known as the <a href="https://en.wikipedia.org/wiki/Nyquist%E2%80%93Shannon_sampling_theorem">Shannon-Nyquist</a>
sampling theorem which states that \(f\) can be encoded without <em>distortion</em>, or error, given samples
\( \{f(x_i)\} \) sampled at a rate of \( (2\Omega)^{-1} \). In other words, if we were to think in terms of
seconds as our unit of time, we’d need \(2\Omega\) samples per second for perfect reconstruction. Specifically, the theorem
states that
\[f(t) = \sum_{n\in\mathbb{Z}}f\left(\frac{n\pi}{\Omega}\right) \text{sinc}(\Omega t - n\pi),\]
where \(\text{sinc}(x) = \frac{\sin(x)}{x}\). It can be shown that this lower bound on the sampling
rate is sharp. That is, there exist bandlimited functions which cannot be reconstructed by this scheme if the sampling
rate is less than \((2\Omega)^{-1}\). It is for this reason that this critical threshold rate is referred to as the
Nyquist rate. Practically speaking, most sensors sample at a rate above the Nyquist rate to prevent aliasing and increase resolution.</p>
<p>It’s worth mentioning that this reconstruction scheme is far from optimal. The reason being that
sinc decays like \(x^{-1}\), so any error in measuring \( f\left(\frac{n\pi}{\Omega}\right)\)
is going to accumulate even at higher frequencies. There is a rich field of math which explores similar
ways of reconstructing signals called <a href="https://en.wikipedia.org/wiki/Frame_(linear_algebra)">frame theory</a>.
In particular, <a href="https://en.wikipedia.org/wiki/Overcompleteness">Gabor frames and wavelets</a> are great examples of how frame theory has come to shape
modern day signal processing.</p>
<p>What’s important is that we’ve reduced a real-valued function down to (finitely many) point
samples, i.e. a vector over \(\mathbb{R}\). In some sense, this is why in fields like compressed
sensing the signals of interest are always vectors instead of continuous functions. Generalizing the discussion above,
in place of pointwise evaluations we suppose we have <em>linear</em> measurements \( y = Ax \), where
\( x \in \mathbb{R}^N \), and \( A \in \mathbb{R}^{m\times N} \) is some linear map.
We’re still not done quantizing, since these measurements take on values
in the continuum. One way or another, we’ll need to estimate \(y\) from vectors in a discrete set, sometimes
called an <em>alphabet</em>, \(\mathcal{A}\). The natural first guess is to make a uniformly spaced grid over
\( [-B,B] \) for some chosen \(B\) and just round each component of \(y\).</p>
<p>The above scheme is often referred to as Memoryless Scalar Quantization (MSQ). We’ll see in a bit why it’s called memoryless.
Now, if we want our grid to have resolution \( \delta \), we’ll need \( m\log\left(\frac{B}{\delta}\right) \) bits.
The number of bits is sometimes called the <em>rate</em>. For fixed \(m\) and letting \( \mathcal{R} = m\log\left(\frac{B}{\delta}\right)\), we find
that the quantization error for one component is bounded by \( \delta = B 2^{-\frac{\mathcal{R}}{m}}\). In other words, the distortion from quantizing using MSQ decays
exponentially with the rate. So we might think this is the best we could do and call it a day. What if, however, we were working
with a circuit which had low storage capacity and could only expend, say, 8 bits. Are we stuck with a mediocre distortion bound?</p>
<p>Fortunately the answer to the above question is no. Just as oversampling improves resolution in analog to digital conversion
increasing \(m\) will lead to lower distortion in recovering \(x\). <a href="http://ieeexplore.ieee.org/abstract/document/650985/">Goyal and coauthors showed</a>
that for MSQ the error in reconstructing \(x\) as a function of \(m\) cannot decay faster than \(O(m^{-1})\). There are, however, quantization schemes which, like MSQ, enjoy
an exponentially decreasing relationship between distortion and rate but also have a distortion which decays like \(O(m^{-r})\) where
\(r\) is any positive integer. Unlike MSQ these encoding schemes “remember” quantization error.</p>
<p>These other quantization schemes fall under the category of <a href="https://en.wikipedia.org/wiki/Noise_shaping">noise shaping</a>.
I will focus particularly on \(\Sigma\Delta\) quantization. Whereas MSQ quantizes \(y_i\) simply based on the value
that \(y_i\) takes, \(\Sigma\Delta\) quantizers feature a state variable \(u\) which encodes the quantization
error of the previous \(r\) components of \(y\) and uses this state variable to quantize the values of \(y\).
Given a fixed positive integer \(r\), an alphabet \(\mathcal{A}\), and a function \(\rho_r: \mathbb{R}^{2r} \to \mathbb{R}\),
\(y_i\) is quantized as
\[ \begin{align}
q_i &= Q_{\mathcal{A}}(\rho_r(u_{i-1},…, u_{i-r}, y_{i},…, y_{i-r+1})) \newline
(\Delta^r u)_i &= y_i - q_i
\end{align}
\]
where \( (\Delta u)_i = u_i - u_{i-1}\) is the difference operator and \(Q_{\mathcal{A}}\) rounds to the nearest
point in \(\mathcal{A}\). For logistical reasons, one needs to choose \(\rho_r\) carefully to ensure that
\( \|u\|_{\infty} \leq C \). Quantization schemes which admit this property are called <em>stable</em>. <a href="https://services.math.duke.edu/~ingrid/publications/annals-v158-n2-p09.pdf">Daubechies and DeVore</a>
were the first to prove that a particular family of \(\rho_r\) is stable in the context of bandlimited functions, even in the case where \(\mathcal{A}\) is fixed.
Apart from these, there are a handful of other \(\rho_r\) that are known to be stable in the \(\Sigma\Delta\) literature depending on whether the alphabet \(\mathcal{A}\) is fixed or allowed to grow.
For the sake of concreteness, we’ll focus on the particular family
\[ \rho_r(u_{i-1},…, u_{i-r}, y_{i},…, y_{i-r+1}) = y_i + \sum_{j=1}^{r} (-1)^{j-1} {r\choose j} u_{i-j}. \]
This scheme ensures that \(\|u\|_{\infty} \leq \frac{\delta}{2}\) if one uses the alphabet \(\{\pm(2j + 1)\delta/2, \,\, j\in\{0,…, L-1\}\}\)
and \(L\) is chosen so that \(L \geq 2\lceil\frac{\|y\|_{\infty}}{\delta}\rceil + 2^r + 1\). That is, this scheme is stable if the alphabet
is allowed to grow exponentially with respect to \(r\).</p>
<p>This is all rather opaque, but for the simple case of \(r = 1\), solving the recurrence relation for the \(u\) terms
yields
\[ \begin{align}
q_i = Q_{\mathcal{A}}\left(\sum_{j=1}^{r+1} y_{i-j+1} - \sum_{j=1}^{r} q_{i-j} \right)
\end{align}\]
and for \( r = 2 \) we have
\[ \begin{align}
q_i = Q_{\mathcal{A}}\left(\sum_{j=1}^{r+1} j y_{i-j+1} - \sum_{j=1}^{r} (j+1) q_{i-j} \right)
\end{align}\]
As someone who dabbles in statistics, I am tempted to interpret this in terms of quantizing the difference of \(0^{th}\)
moments for \(r = 1\) and the difference in means for \(r = 2\). Unfortunately this explanation is disingenuous
and doesn’t hold up for \(r > 2\). It appears to be the case that the coefficients in the summands are
<a href="http://oeis.org/wiki/Simplicial_polytopic_numbers">simplicial numbers</a> which, to the best of my knowledge, have no
important applications in quadrature rules or estimating moments/cumulants.</p>
<p>I think engineers view this in terms of high-pass and low-pass filters. Namely, if the difference between \(y_{i-1}\)
and \(q_{i-1}\) is significant due to quantization error, the rate of change of \(\sum_{j=1}^{r} (-1)^{j-1} {r\choose j} u_{i-j}\) isn’t likely
to be as large as that of the fluctuation in \(u\). Appealing to mathematical intuition, this is simply because integration
“smoothes out” small perturbations, and \(\sum_{j=1}^{r} (-1)^{j-1} {r\choose j} u_{i-j}\) is acting as a quadrature rule for integrating \(u\). As such, \(q\)
is going to be composed of lower frequencies compared to the quantization error encoded in \(u\). Pushing error into higher frequencies is, after all,
the name of the game in noise shaping.</p>
<p>Long story short, noise shaping techniques like \(\Sigma\Delta\) enjoy favorable distortion decay
as you oversample. Namely, if you use a \(r^{th}\) order \(\Sigma\Delta\) scheme with an appropriately chosen alphabet
(see Section 2 in <a href="https://arxiv.org/pdf/1306.4549.pdf">this paper</a> for a more thorough treatment; importantly, the one bit alphabet and uniform grids
as mentioned above are included as examples), then the reconstruction
error (in Euclidean norm) for recovering \(x\in \mathbb{R}^N\) from \(q\) decays like \(O(m^{-r})\).
Let me repeat: this is true even in the extreme case where your alphabet is \(\{-1,1\}\), i.e.
<em>one bit \(\Sigma\Delta\) quantization preserves information
about the length of the vector \(y\)</em> while MSQ clearly fails to do so.</p>
<p>If by now you’re on the \(\Sigma\Delta\) bandwagon and want to incorporate it into any of your projects, feel free to use some
<a href="https://github.com/elybrand/SigmaDelta">MATLAB code</a> I wrote. For recent applications of \(\Sigma\Delta\) quantization
in compressed sensing, I’ll shamelessly promote <a href="https://arxiv.org/abs/1709.09803">this work</a> on recovering low rank matrices with quantization
and <a href="https://arxiv.org/abs/1801.08639">this work</a> which uses \(\Sigma\Delta\) quantization with structured random matrices in compressed sensing.</p>Eric LybrandUpdate: an earlier version of this post incorrectly characterized the contribution from Daubechies and DeVore. The family of quantizing functions \(\rho_r\) that they use is not the greedy scheme detailed below but is slightly more complicated. A special thanks goes out to Rayan Saab for a conversation we had about this detail.Do Neural Networks Learn Shearlets?2017-11-09T00:00:00-08:002017-11-09T00:00:00-08:00https://elybrand.github.io//sparse_net<p>In the spring quarter of 2017, the signal processing group at UCSD decided to
base our quarterly seminar on recent advancements in deep learning. Applied harmonic analysis
was in the air, as <a href="https://ccom.ucsd.edu/~acloninger/index.html">Alex Cloninger</a> (now at UCSD)
and coauthors just a year prior had published a <a href="https://arxiv.org/abs/1509.07385">manuscript</a>
which constructs a (sparse) 4-layer neural network to approximate functions on manifolds with wavelets.
About halfway through the quarter, <a href="http://www.tu-berlin.de/?108957">Gitta Kutyniok</a> and coauthors
posted a <a href="https://arxiv.org/abs/1705.01714">manuscript</a> on arXiv and it quickly worked its way into
our rotation of papers to present. As a nice follow up, Kutyniok came to speak about this work at UCSD in back in August.
That, and the gentle encouragement from my friend <a href="http://www.ngurnani.com/">Nish</a> motivated
me to write this post. I’ll start with a summary of their approach.</p>
<h2 id="a-very-brief-summary">A Very Brief Summary</h2>
<p>The title is “Optimal Approximation with Sparsely Connected Deep Neural Networks”. It has
all of the buzz words that signal processing folk like myself enjoy: sparsity, \(L^2\),
representation systems, and rate-distortion theory. The basic set-up is as follows. Suppose
you have a collection of functions,\(\mathcal{C} \subseteq L^2(\mathbb{\Omega})\), defined on \( \Omega \subset \mathbb{R}^n\) which you’d like to
approximate. Let’s say you’re willing to tolerate \(\varepsilon > 0\) error. Then for
\(f \in \mathcal{C}\) and some optimization procedure for training a neural network \(\Phi\),
how many edges in \(\Phi \) are necessary to satisfy \(\|f - \Phi\|_2 \leq \varepsilon\)?
In other words, how complex does an approximating neural network have to be
in terms of the complexity of the function class you’re trying to approximate?</p>
<p>What makes their results impactful is that the lower bound on the complexity of the
desired neural network is <em>independent</em> of the learning algorithm employed. The main motif of the paper
is that neural networks can operate as representation systems. In the context of harmonic analysis, representation systems
act as spanning sets of \(L^2(\Omega)\).
Common representation systems include trigonometric polynomials
in the case of Fourier analysis and, more generally, wavelets. Representation systems
are ubiquitous in part because of their practical importance in compressing signals. Instead
of storing pointwise samples of a function, you could store, say, a thousand of its Fourier coefficients and
get improved bounds on the reconstruction error. JPEG 2000 is famously known for using wavelets.</p>
<p>This brings us to the second motif of the paper, namely that neural networks and representation systems are means of
encoding functions as bit strings. If \( \{ \varphi_i \} \subseteq L^2(\Omega) \) is a representation system,
then you could encode a function \( f \in \mathcal{C} \) by taking an \(M\)-term approximation
\( \sum_{i=1}^{M} c_i \varphi_i \) and storing the coefficients \(c_i\) in a bit string. A similar procedure can be done with neural networks.
Whereas you took \(M\)-term approximations in the representation systems setting,
here you take neural networks which have at most \(M\) non-zero
edge weights. To store a neural network as a bit string you store each of the edge weights,
and then come up with some scheme which encodes the topology of the network. Section 2.3 outlines
this procedure in detail.</p>
<p>More generally, binary encoders are functions \(E: \mathcal{C} \to \{ 0,1 \}^{\ell} \) where
\(\ell \in \mathbb{N}\), and decoders are functions \(D: \{0,1\}^{\ell} \to L^2(\Omega) \).
For a given distortion \(\varepsilon > 0\), the minimax code length \(L(\mathcal{C}, \varepsilon) \)
is the minimal \( \ell \) for which there is some
encoder-decoder pair \((E,D)\) so that \( \sup_{f \in \mathcal{C}} |D(E(f)) - f|_2 \leq \varepsilon \).
One can ask how this code-length grows as you shrink \(\varepsilon \). This behavior is captured
by the quantity \( \gamma^*(\mathcal{C}) \), which is the infimal \( \gamma \in \mathbb{R} \)
such that \(L(\varepsilon, \mathcal{C}) = O(\varepsilon^{-\gamma}) \).</p>
<p>The manuscript defines analogous error decay rates for the best \(M\)-term approximations by representation systems
and neural networks, looking at how the distortion decays as \(M\) increases.
As a consequence of neural networks being a special class of encoders, we see that
a neural network \(\Phi \) satisfying \( \|f - \Phi\|_2 < \varepsilon \) for \(f \in \mathcal{C} \)
must necessarily have at least \( O(\varepsilon^{-\gamma^*(\mathcal{C})}) \) non-zero edge weights. Theorem
2.7 offers a proof of this claim.</p>
<p>The remainder of the paper is dedicated to answering the following question: for what signal classes is the above
lower bound sharp? The answer to this question comes in two pieces. The first comes from characterizing what kinds of
representation systems are optimal for approximating a signal class \(\mathcal{C}\). The second piece then comes from
realizing that you can go from an optimal \(M\)-term approximation in a representation system to a neural network
with \(O(M)\) edges by simply approximating each function in the \(M\)-term approximation. Details are in the proof
of Theorem 3.4.</p>
<p>It turns out that a large collection of signal classes, defined as affine systems in the paper,
enjoy having this optimal lower bound on the number of edges needed for approximating neural networks.
Affine systems are basically wavelet frames, or collections of functions which are affine scalings
and translations of, to use a term that a few of my colleagues cringe at, a “mother” function. You might expect
why such systems show up. The main idea appeared in Cloninger’s and coauthors’ paper cited at the beginning of this post,
where you first construct the mother function out of some affine combination of your rectifier functions and the layers
downstream from that scale them. The Cloninger paper builds a trapezoidal bump function out of
ReLUs. I will remark that noticeably absent from the collection of affine systems are Gabor frames, but this is beside the point.</p>
<p>The theoretical portion of the manuscript ends with the particular affine system of shearlets, which, in the sense of Definition 2.5, are known
to optimally represent cartoon-like functions; see <a href="https://arxiv.org/abs/1702.03559">this manuscript</a>. Cartoon-like functions were introduced by David Donoho
in <a href="https://statweb.stanford.edu/~donoho/Reports/1998/SCA.pdf">his paper</a> on sparse components of images.
Intuitively, they are piecewise smooth functions on the unit square \([0,1]^2\) where the boundaries
between the piecewise regions are smooth. Theorem 5.6 proves that the approximation error with \(M\)-edge neural
networks obeys the same decay rate as those enjoyed by shearlet approximations.</p>
<h2 id="reproducing-the-numerical-results">Reproducing the Numerical Results</h2>
<p>I began this blog post by mentioning that the main motif of the “Optimal Approximation” manuscript
is that neural networks can behave like representation systems, but it remains to be seen if they actually
learn representations that mathematicians have found to be optimal for certain signal classes. There is a numerical experiments section at the end of the paper which considers
classifying regions in the unit square in \(\mathbb{R}^2\) with linear and quadratic decision boundaries.
The authors specify the network topology which is inspired by Cloninger’s network. Its main feature is that it is a bunch of sub-networks
running in parallel. Each sub-network can be thought of as a function in a representation system which the aggregate
network will learn. The network is trained with stochastic gradient descent with the usual backpropagation and \(\ell^2\) loss.
All weights except those in the second layer are trainable. The reason for fixing the weights in the second layer is to encourage the
first two layers to learn something like a bump function. See the paragraph below Figure 3 for a more complete explanation.</p>
<p>Fast forward to the figures and your eyes will behold the graphs of subnetworks which appear to have learned shearlets! If you’re like me,
you start to carefully read through the experiments section again: how many edges were in the aggregate networks which learned
these shearlet like functions? how did they initialize their trainable weights? how many training samples did they use? how much did
these shearlet subnetworks “contribute” to classifying samples? why does the input layer have four inputs if they are classifying
points in \(\mathbb{R}^2\)?</p>
<p>I don’t have definitive answers to these questions, except maybe the last one. To the best of my knowledge, this code is not publicly available. Even if it were,
I have been told that the experiments were run in MATLAB. This explains the extra two inputs in the input layer. I suspect that these act as the bias terms.
In any case, I have written code which performs their numerical experiments in Keras. You can download the jupyter notebook
<a href="https://github.com/elybrand/sparse_net">here</a>. I encourage you to play around with it yourself: change the decision boundary to something more complicated, fiddle with the number of subnetworks,
adjust the number of samples used in the training set.</p>
<p>I restricted myself to looking at the quadratic decision boundary since that’s what the manuscript considered. I generated
three random integers between 0 and 100 for the seeds of the tensorflow and numpy random number generators. Using all possible
combinations of the seeds, I trained networks with 15, 30, and 45 subnetworks on a training set of 2500 samples from a uniform grid
on the unit square with a decision boundary given by \( p(x) = x^2 - 0.4 \). I used a batch size of 10 and trained over 10 epochs.
After training, the top 10 subnetworks with the largest weights in absolute value in the last layer are graphed in decreasing order.
This is the proxy that the manuscript uses to measure the significance of the subnetworks. Here is a table and some figures outlining what I saw.</p>
<table>
<thead>
<tr>
<th style="text-align: center">NUM_SUBNETS</th>
<th style="text-align: center">Numpy Seed</th>
<th style="text-align: center">Tensorflow Seed</th>
<th style="text-align: right">Notable Shearlet Looking Networks, by Rank</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: center">15</td>
<td style="text-align: center">47</td>
<td style="text-align: center">47</td>
<td style="text-align: right">2, 4, 8, 9, 10</td>
</tr>
<tr>
<td style="text-align: center">15</td>
<td style="text-align: center">47</td>
<td style="text-align: center">82</td>
<td style="text-align: right">9, 10</td>
</tr>
<tr>
<td style="text-align: center">15</td>
<td style="text-align: center">47</td>
<td style="text-align: center">96</td>
<td style="text-align: right">Unremarkable</td>
</tr>
<tr>
<td style="text-align: center">15</td>
<td style="text-align: center">82</td>
<td style="text-align: center">47</td>
<td style="text-align: right">2, 3, 10</td>
</tr>
<tr>
<td style="text-align: center">15</td>
<td style="text-align: center">82</td>
<td style="text-align: center">82</td>
<td style="text-align: right">Unremarkable</td>
</tr>
<tr>
<td style="text-align: center">15</td>
<td style="text-align: center">82</td>
<td style="text-align: center">96</td>
<td style="text-align: right">2, 3, 7, 8</td>
</tr>
<tr>
<td style="text-align: center">15</td>
<td style="text-align: center">96</td>
<td style="text-align: center">47</td>
<td style="text-align: right">3, 4, 7</td>
</tr>
<tr>
<td style="text-align: center">15</td>
<td style="text-align: center">96</td>
<td style="text-align: center">82</td>
<td style="text-align: right">3, 6</td>
</tr>
<tr>
<td style="text-align: center">15</td>
<td style="text-align: center">96</td>
<td style="text-align: center">96</td>
<td style="text-align: right">Accuracy < 0.5</td>
</tr>
<tr>
<td style="text-align: center">30</td>
<td style="text-align: center">47</td>
<td style="text-align: center">47</td>
<td style="text-align: right">3, 7</td>
</tr>
<tr>
<td style="text-align: center">30</td>
<td style="text-align: center">47</td>
<td style="text-align: center">82</td>
<td style="text-align: right">4, 5, 7</td>
</tr>
<tr>
<td style="text-align: center">30</td>
<td style="text-align: center">47</td>
<td style="text-align: center">96</td>
<td style="text-align: right">3, 5, 6, 8</td>
</tr>
<tr>
<td style="text-align: center">30</td>
<td style="text-align: center">82</td>
<td style="text-align: center">47</td>
<td style="text-align: right">2</td>
</tr>
<tr>
<td style="text-align: center">30</td>
<td style="text-align: center">82</td>
<td style="text-align: center">82</td>
<td style="text-align: right">1, 5 (almost)</td>
</tr>
<tr>
<td style="text-align: center">30</td>
<td style="text-align: center">82</td>
<td style="text-align: center">96</td>
<td style="text-align: right">7</td>
</tr>
<tr>
<td style="text-align: center">30</td>
<td style="text-align: center">96</td>
<td style="text-align: center">47</td>
<td style="text-align: right">1, 3, 8</td>
</tr>
<tr>
<td style="text-align: center">30</td>
<td style="text-align: center">96</td>
<td style="text-align: center">82</td>
<td style="text-align: right">Unremarkable</td>
</tr>
<tr>
<td style="text-align: center">30</td>
<td style="text-align: center">96</td>
<td style="text-align: center">96</td>
<td style="text-align: right">6</td>
</tr>
<tr>
<td style="text-align: center">45</td>
<td style="text-align: center">47</td>
<td style="text-align: center">47</td>
<td style="text-align: right">7</td>
</tr>
<tr>
<td style="text-align: center">45</td>
<td style="text-align: center">47</td>
<td style="text-align: center">82</td>
<td style="text-align: right">Unremarkable</td>
</tr>
<tr>
<td style="text-align: center">45</td>
<td style="text-align: center">47</td>
<td style="text-align: center">96</td>
<td style="text-align: right">Unremarkable</td>
</tr>
<tr>
<td style="text-align: center">45</td>
<td style="text-align: center">82</td>
<td style="text-align: center">47</td>
<td style="text-align: right">1, 3, 5, 7</td>
</tr>
<tr>
<td style="text-align: center">45</td>
<td style="text-align: center">82</td>
<td style="text-align: center">82</td>
<td style="text-align: right">9, 10</td>
</tr>
<tr>
<td style="text-align: center">45</td>
<td style="text-align: center">82</td>
<td style="text-align: center">96</td>
<td style="text-align: right">5</td>
</tr>
<tr>
<td style="text-align: center">45</td>
<td style="text-align: center">96</td>
<td style="text-align: center">47</td>
<td style="text-align: right">6, 8</td>
</tr>
<tr>
<td style="text-align: center">45</td>
<td style="text-align: center">96</td>
<td style="text-align: center">82</td>
<td style="text-align: right">2, 4</td>
</tr>
<tr>
<td style="text-align: center">45</td>
<td style="text-align: center">96</td>
<td style="text-align: center">96</td>
<td style="text-align: right">Accuracy < 0.5</td>
</tr>
</tbody>
</table>
<table>
<thead>
<tr>
<th style="text-align: center">15 Subnetworks</th>
<th style="text-align: center">30 Subnetworks</th>
<th style="text-align: center">45 Subnetworks</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: center"><img src="/assets/images/2017_sparse_net/ex15_47_47_2.png" alt="ex15_47_47_2" /> np: 47, TF: 47</td>
<td style="text-align: center"><img src="/assets/images/2017_sparse_net/ex30_47_96_8.png" alt="ex30_47_96_8" /> np: 47, TF: 96</td>
<td style="text-align: center"><img src="/assets/images/2017_sparse_net/ex45_82_47_1.png" alt="ex45_82_47_1" /> np: 82, TF: 47</td>
</tr>
<tr>
<td style="text-align: center"><img src="/assets/images/2017_sparse_net/ex15_47_47_8.png" alt="ex15_47_47_8" /> np: 47, TF: 47</td>
<td style="text-align: center"><img src="/assets/images/2017_sparse_net/ex30_82_47_2.png" alt="ex30_82_47_2" /> np: 82, TF: 47</td>
<td style="text-align: center"><img src="/assets/images/2017_sparse_net/ex45_82_47_3.png" alt="ex45_82_47_3" /> np: 82, TF: 47</td>
</tr>
<tr>
<td style="text-align: center"><img src="/assets/images/2017_sparse_net/ex15_82_47_2.png" alt="ex15_82_47_2" /> np: 82, TF: 47</td>
<td style="text-align: center"><img src="/assets/images/2017_sparse_net/ex30_82_96_7.png" alt="ex30_82_96_7" /> np: 82, TF: 96</td>
<td style="text-align: center"><img src="/assets/images/2017_sparse_net/ex45_82_82_2.png" alt="ex45_82_82_2" /> np: 82, TF: 82</td>
</tr>
<tr>
<td style="text-align: center"><img src="/assets/images/2017_sparse_net/ex15_82_47_3.png" alt="ex15_82_47_3" /> np: 82, TF: 47</td>
<td style="text-align: center"><img src="/assets/images/2017_sparse_net/ex30_96_47_3.png" alt="ex30_96_47_3" /> np: 96, TF: 47</td>
<td style="text-align: center"><img src="/assets/images/2017_sparse_net/ex45_82_96_7.png" alt="ex45_82_96_7" /> np: 82, TF: 96</td>
</tr>
</tbody>
</table>
<p>I don’t claim that every subnetwork output I marked as being notable is indeed a shearlet.
Basically what I was looking for was, as the name suggests, sheared trapezoids. In particular,
the manuscript mentions that shearlets “on high scales”, or which are strongly anisotropic,
should appear on or near points of singularity; see <a href="https://www.math.tu-berlin.de/fileadmin/i26_fg-kutyniok/Kutyniok/Papers/ShearletsContDiscr.pdf">this</a> and <a href="http://www3.math.tu-berlin.de/numerik/mt/www.shearlet.org/papers/SparseShearSIAMfinal.pdf">this</a>.
By that metric, there were more of these that
appeared in the experiments I ran than I expected. To answer the question posed in the title of this blog post, I think
that neural networks sometimes learn functions which look like shearlets when you have a quadratic decision boundary.
I have some reservations about whether this holds true for general cartoon-like functions. If it were true that
learning shearlets for cartoon-like decision boundaries were provably optimal, regardless of network topology, and
that cartoon-like functions offer good approximations for natural images, then this greatly undercuts the success of
breakthroughs in image classification with deep learning in my opinion.</p>
<p>I think the more interesting point of this manuscript is that neural networks do just as well as representation systems, if not better.
The better question to explore is what types of functions are learned by neural networks when they’re trained to approximate
a signal class \(\mathcal{C}\). I’m not sure whether we’re equipped mathematically to answer such questions yet. Nevertheless, I think
Bölcskei, Grohs, Kutyniok, and Petersen have laid excellent ground work for mathematicians to build off of in answering questions about
deep learning.</p>Eric LybrandIn the spring quarter of 2017, the signal processing group at UCSD decided to base our quarterly seminar on recent advancements in deep learning. Applied harmonic analysis was in the air, as Alex Cloninger (now at UCSD) and coauthors just a year prior had published a manuscript which constructs a (sparse) 4-layer neural network to approximate functions on manifolds with wavelets. About halfway through the quarter, Gitta Kutyniok and coauthors posted a manuscript on arXiv and it quickly worked its way into our rotation of papers to present. As a nice follow up, Kutyniok came to speak about this work at UCSD in back in August. That, and the gentle encouragement from my friend Nish motivated me to write this post. I’ll start with a summary of their approach.