250624 Compressed Sensing Chapter 1 note

Fundamental Concepts from Set Theory & Linear Algebra

  • Subset
    A set A is a subset of another set B if all elements of A are also elements of B. This is denoted as ABA \subset B.

    • Example from the text: The chapter discusses selecting a portion of a signal’s coefficients. If you have a signal with n coefficients, the set of all coefficient indices is {1,2,...,n}\{1, 2, ..., n\}. An index set Λ\Lambda that points to the locations of the k-largest entries is a subset of the full set of indices, so Λ{1,2,...,n}\Lambda \subset \{1, 2, ..., n\}.
  • Vector Space
    A vector space is a collection of objects called vectors, which can be added together and multiplied by scalars (numbers). The key idea is that if you take any two vectors from the space and add them, you get another vector that is still within that space. Similarly, scaling a vector keeps it within the space. The text models signals as vectors in a vector space.

  • Span
    The span of a set of vectors is the set of all possible vectors you can create by taking linear combinations of them. In other words, if you have vectors v1,v2,...,vkv_1, v_2, ..., v_k, their span is the set of all vectors ww that can be written as w=c1v1+c2v2+...+ckvkw = c_1v_1 + c_2v_2 + ... + c_kv_k for some scalar coefficients cic_i.

  • Linear Independence
    A set of vectors is linearly independent if no vector in the set can be written as a linear combination of the others. A more formal way to say this is that the only way to get the zero vector from their linear combination (c1v1+c2v2+...+ckvk=0c_1v_1 + c_2v_2 + ... + c_kv_k = 0) is if all the scalar coefficients (c1,c2,...c_1, c_2, ...) are zero.

  • Subspace
    A subspace is a vector space that is contained within another, larger vector space. To be a subspace, a set of vectors must satisfy three rules:

    1. It must contain the zero vector.
    2. It must be “closed under addition”: if you add any two vectors from the set, their sum is also in the set.
    3. It must be “closed under scalar multiplication”: if you multiply any vector in the set by a scalar, the result is also in the set.
    • Example from the text: The set of all 2-sparse signals in R3\mathbb{R}^3 is not a single subspace, because adding two 2-sparse signals can result in a 3-sparse signal. Instead, it’s a union of subspaces, where each subspace is a plane defined by two coordinate axes (like the x-y plane, x-z plane, etc.).
  • Null Space
    The null space of a matrix AA, denoted N(A)\mathcal{N}(A), is the set of all vectors zz that give the zero vector when multiplied by AA.

    N(A)={z:Az=0}\mathcal{N}(A) = \{z : Az = 0\}

    • Relevance in CS: In compressed sensing, if you have two different sparse signals, xx and xx', that produce the same measurement yy (i.e., Ax=Ax=yAx = Ax' = y), then their difference, h=xxh = x - x', must be in the null space of AA because A(xx)=0A(x-x')=0. To guarantee unique recovery, the null space of AA should not contain any sparse vectors (other than the zero vector).

Rn\mathbb{R}^n (n-dimensional Euclidean Space)

  • Definition: The symbol Rn\mathbb{R}^n refers to the n-dimensional Euclidean space. It is the set of all vectors that consist of n real-number elements.
  • Breakdown of the Notation:
    • The R\mathbb{R} stands for the set of all real numbers.
    • The superscript n indicates the dimension, meaning each vector in this set is an ordered list of n real numbers, like [x1,x2,...,xn][x_1, x_2, ..., x_n].
  • Context from the Text: The chapter states that signals can be viewed as vectors in an n-dimensional Euclidean space, which is denoted by Rn\mathbb{R}^n. For example, R2\mathbb{R}^2 is the 2D plane, and R3\mathbb{R}^3 is 3D space.

Basis

  • Definition: A set of vectors {ϕi}i=1n\{\phi_{i}\}_{i=1}^{n} is called a basis for Rn\mathbb{R}^n if the vectors in the set span Rn\mathbb{R}^n and are linearly independent. This means two things:
    1. Span the space: The vectors in the set can be scaled and added in some combination to create any vector in the entire space.
    2. Linearly Independent: There are no redundant vectors in the set. No vector in the basis can be created from a combination of the others.
  • Analogy: A basis is like a set of fundamental directions for a space. For the 2D plane (R2\mathbb{R}^2), the standard basis is the pair of vectors [1, 0] (the x-direction) and [0, 1] (the y-direction). You can get to any point on the plane by moving some amount in the x-direction and some amount in the y-direction.

Dimension

  • Definition: The dimension of a vector space is the number of vectors in any of its bases.

  • Key Property: For any given vector space, although there can be many different sets of basis vectors, every basis for that space will have the exact same number of vectors. This unique number defines the dimension of the space.

    • For example, any basis for the 2D plane (R2\mathbb{R}^2) must have exactly two vectors, so its dimension is 2.
    • A line passing through the origin is a 1-dimensional subspace because its basis consists of only one vector.
  • Rank
    The rank of a matrix is the maximum number of linearly independent columns (or, equivalently, rows) in the matrix. It represents the dimension of the subspace spanned by its columns.

    • Relevance in CS: The concept of rank is central to the low-rank matrix model. Just as a sparse vector has few non-zero entries, a low-rank matrix has “few” linearly independent columns/rows, meaning it contains less information than its ambient dimensions suggest. This structure is another type of simplicity that can be exploited for recovery from limited data.

Affine space

An affine space is essentially a vector subspace that has been shifted so that it does not necessarily pass through the origin.

Here’s a breakdown:

  • Vector Subspace: In R2\mathbb{R}^2, a one-dimensional vector subspace is a straight line that must pass through the origin (the point (0,0)).
  • Affine Space: An affine space is a generalization. It can be a point, a line, or a plane that does not need to go through the origin. It’s created by taking a vector subspace and adding a fixed vector to every point in it, effectively “translating” it.

The “one-dimensional affine space A” is illustrated in Figure 1.2 as a straight line that does not pass through the origin. The goal is to find the point on this line that is closest to the signal vector xx.


A bit more math used in Compressed Sensing

Vector Space Concepts

lpl_p Norms

For a vector xRnx \in \mathbb{R}^n, the lpl_p norm is defined as:

xp={(i=1nxip)1p,p[1,)maxi=1,2,...,nxi,p=||x||_p = \begin{cases} \left(\sum_{i=1}^{n}|x_i|^p\right)^{\frac{1}{p}}, & p \in [1, \infty) \\ \max_{i=1,2,...,n}|x_i|, & p=\infty \end{cases}

where:

  • xix_i represents the i-th element of the vector xx.

l0l_0 “Norm”

The l0l_0 “norm” counts the number of non-zero elements in a vector:
x0:=supp(x)||x||_0 := |\text{supp}(x)|
where:

  • supp(x)\text{supp}(x) is the “support” of the vector xx, which is the set of indices of its non-zero elements, i.e., {i:xi0}\{i : x_i \ne 0\}.
  • |\cdot| denotes the cardinality (the number of elements) of the set.

Inner Product

The standard inner product in Rn\mathbb{R}^n is defined as:

x,z=zTx=i=1nxizi\langle x,z \rangle = z^T x = \sum_{i=1}^{n} x_i z_i

where:

  • zTz^T denotes the transpose of the vector zz.

Basis and Frames

  • Basis: A set of vectors {ϕi}i=1n\{\phi_i\}_{i=1}^{n} is a basis for Rn\mathbb{R}^n if they span the space and are linearly independent. Any vector xx has a unique representation x=i=1nciϕix = \sum_{i=1}^{n} c_i \phi_i. In matrix form, this is x=Φcx = \Phi c, where:
    • Φ\Phi is the n×nn \times n matrix whose columns are the basis vectors ϕi\phi_i.
    • cc is the vector containing the coefficients cic_i.
  • Orthonormal Basis: A basis is orthonormal if ϕi,ϕj={1,i=j0,ij\langle \phi_i, \phi_j \rangle = \begin{cases} 1, & i=j \\ 0, & i \ne j \end{cases}.
  • Frame: A set of vectors {ϕi}i=1n\{\phi_i\}_{i=1}^{n} in Rd\mathbb{R}^d (d<nd<n) is a frame if for any vector xRdx \in \mathbb{R}^d, the following holds for constants 0<AB<0 < A \le B < \infty:

    Ax22ΦTx22Bx22A||x||_2^2 \le ||\Phi^T x||_2^2 \le B||x||_2^2

    • AA and BB are called the frame bounds. If A=BA=B, the frame is tight.
  • Dual Frame: A frame Φ~\tilde{\Phi} is a dual frame of Φ\Phi if ΦΦ~T=Φ~ΦT=I\Phi\tilde{\Phi}^T = \tilde{\Phi}\Phi^T = I, where II is the identity matrix. The canonical dual frame is given by Φ~=(ΦΦT)1Φ\tilde{\Phi} = (\Phi\Phi^T)^{-1}\Phi, where ()1(\cdot)^{-1} denotes the matrix inverse.

Signal Models

Sparse and Compressible Signals

  • k-sparse signal: A signal xx is k-sparse if it has at most kk non-zero entries, i.e., x0k||x||_0 \le k. The set of all k-sparse signals is denoted by Σk={x:x0k}\Sigma_k = \{x: ||x||_0 \le k\}.
  • Compressible signal: A signal is compressible if it can be well-approximated by a sparse signal. The approximation error is given by:

    σk(x)p=minx^Σkxx^p\sigma_k(x)_p = \min_{\hat{x} \in \Sigma_k} ||x-\hat{x}||_p

    • Here, x^\hat{x} is the sparse approximation of the signal xx from the set Σk\Sigma_k.
  • Power Law Decay: A signal is compressible if its sorted coefficients ci|c_i| decay according to a power law:

    ciC1iq|c_i| \le C_1 i^{-q}

    • C1C_1 and qq are positive constants.

Union of Subspaces and Low-Rank Models

  • Union of Subspaces: A signal xx lies in a union of MM subspaces if:

    xU=i=1MUix \in \mathcal{U} = \bigcup_{i=1}^{M} \mathcal{U}_i

    • Ui\mathcal{U}_i are the individual subspaces, and MM is the total number of subspaces in the union.
  • Low-Rank Matrix: The set of low-rank matrices is defined as:

    Lr={MRn1×n2:rank(M)r}\mathcal{L}_r = \{M \in \mathbb{R}^{n_1 \times n_2} : \text{rank}(M) \le r\}

    • MM is a matrix of size n1×n2n_1 \times n_2.
    • rr is the maximum rank of the matrices in the set.

Signal Recovery

l1l_1 Minimization

This involves solving one of the following optimization problems:

  • Noise-free case (Basis Pursuit):

    x^=argminzz1subject toAz=y\hat{x} = \arg\min_{z} ||z||_1 \quad \text{subject to} \quad Az=y

  • Bounded noise case:

    x^=argminzz1subject toAzy2ϵ\hat{x} = \arg\min_{z} ||z||_1 \quad \text{subject to} \quad ||Az-y||_2 \le \epsilon

  • Dantzig Selector:

    x^=argminzz1subject toAT(Azy)λ\hat{x} = \arg\min_{z} ||z||_1 \quad \text{subject to} \quad ||A^T(Az-y)||_\infty \le \lambda

  • LASSO (unconstrained form):

    x^=argminz12Azy22+λz1\hat{x} = \arg\min_z \frac{1}{2} ||Az-y||_2^2 + \lambda||z||_1

In these formulas:

  • x^\hat{x} is the estimated signal.
  • argminz\arg\min_{z} finds the vector zz that minimizes the given objective function.
  • ϵ\epsilon is a constant representing the upper bound on the norm of the noise.
  • λ\lambda is a regularization parameter that balances the trade-off between the data fidelity term (Azy22||Az-y||_2^2) and the sparsity term (z1||z||_1).

Recovery Guarantees

  • Theorem 1.8 (Noise-Free Recovery): If AA satisfies the RIP of order 2k with δ2k<21\delta_{2k} < \sqrt{2}-1, the solution x^\hat{x} to the noise-free l1l_1 minimization problem obeys:

    x^x2C0σk(x)1k||\hat{x}-x||_2 \le C_0 \frac{\sigma_k(x)_1}{\sqrt{k}}

  • Theorem 1.9 (Bounded Noise): Under similar RIP conditions and with noise ee where e2ϵ||e||_2 \le \epsilon, the solution obeys:

    x^x2C0σk(x)1k+C2ϵ||\hat{x}-x||_2 \le C_0 \frac{\sigma_k(x)_1}{\sqrt{k}} + C_2 \epsilon

In these theorems:

  • C0C_0 and C2C_2 are constants that depend on the RIP constant δ2k\delta_{2k}.
  • ee represents a noise vector corrupting the measurements.

Multiple Measurement Vector (MMV) Problem

This problem involves recovering a set of ll sparse vectors {xi}i=1l\{x_i\}_{i=1}^l that share a common support. These vectors form the columns of a matrix XX which is row-sparse.

  • Uniqueness Condition (Theorem 1.15): A row-sparse matrix XX is uniquely determined by Y=AXY=AX if:

    supp(X)<spark(A)1+rank(X)2|\text{supp}(X)| < \frac{\text{spark}(A)-1+\text{rank}(X)}{2}

    • YY is the m×lm \times l matrix of measurements, with columns being the measurements yiy_i for each signal xix_i.
  • lp,ql_{p,q} Norms: Used for matrix recovery, defined as Xp,q=(ixipq)1/q||X||_{p,q} = (\sum_i ||x^i||_p^q)^{1/q}, where xix^i is the i-th row of the matrix XX.

Understand matrix properties in compressed sensing

Coherence

The coherence μ(A)\mu(A) of a matrix AA is the largest absolute inner product between any two distinct normalized columns ai,aja_i, a_j:

μ(A)=max1i<jnai,ajai2aj2\mu(A) = \max_{1 \le i < j \le n} \frac{|\langle a_i, a_j \rangle|}{||a_i||_2 ||a_j||_2}

  • aia_i and aja_j are columns of the matrix AA.

Excellent request. The chapter introduces several properties of the sensing matrix A, each with different strengths, guarantees, and requirements. Organizing them is the best way to see the relationships between them.

Here is an organization of the sensing matrix properties discussed in the chapter, from the most fundamental to the most powerful.

1. Spark

This is the most fundamental condition for guaranteeing a unique sparse solution.

Definition

The spark of a matrix AA, denoted spark(A)spark(A) is the smallest number of columns of AA that are linearly dependent.

I find the definition on Wikipedia is more precise:
https://en.wikipedia.org/wiki/Spark_(mathematics)
The spark of an m×nm \times n matrix AA is the smallest integer kk, such that there exists a set of kk columns of AA that are linearly dependent

What it Guarantees

Theorem 1.1: For any vector yRmy \in \mathbb{R}^m, there is at most one signal xΣkx \in \Sigma_k such that y=Axy=Ax if and only if spark(A)>2k\text{spark}(A) > 2k.

It provides the definitive “if and only if” condition for the unique recovery of exactly k-sparse signals in a noise-free setting.

Key Requirement

  • For unique recovery of any k-sparse signal, the condition is spark(A)>2kspark(A) > 2k.

  • Requirement on mm: It’s easy to see that spark(A)[2,m+1]spark(A) \in \left[2, m+1\right]. This condition immediately implies that the number of measurements mm, i.e., m2km \ge 2k.

Practicality

Calculating the spark for a general matrix is NP-hard and computationally infeasible for large matrices. (https://en.wikipedia.org/wiki/Spark_(mathematics))

2. Null Space Property (NSP)

The NSP is a more refined condition on the null space of AA that is necessary for robust recovery.

Definition

A matrix AA has the Null Space Property of order kk if for a constant C>0C>0, the following holds for all hN(A)h \in \mathcal{N}(A) and all index sets Λ\Lambda with Λk|\Lambda| \le k:

hΛ2ChΛc1k||h_\Lambda||_2 \le C \frac{||h_{\Lambda^c}||_1}{\sqrt{k}}

  • N(A)\mathcal{N}(A) is the null space of matrix AA, which is the set of all vectors zz for which Az=0Az=0.
  • Λ\Lambda is a subset of indices {1,2,...,n}\{1, 2, ..., n\}.
  • Λc\Lambda^c is the complement of Λ\Lambda.
  • hΛh_\Lambda is a vector formed by keeping the entries of hh indexed by Λ\Lambda and setting all other entries to zero.
  • hΛch_{\Lambda^c} is a vector formed by keeping the entries of hh indexed by Λc\Lambda^c and setting all other entries to zero.

In other words, a matrix AA satisfying the NSP of order kk means that vectors in its null space are not overly concentrated on a small set of indices.

What it Guarantees

The NSP is presented as a necessary condition for any stable recovery algorithm.

Theorem 1.2: Let A:RnRmA : \mathbb{R}^n \rightarrow \mathbb{R}^m denote a sensing matrix and Δ:RmRn\Delta : \mathbb{R}^m \rightarrow \mathbb{R}^n denote an arbitrary recovery algorithm. If the pair (A,Δ)(A, \Delta) satisfies

Δ(Ax)x2Cσk(x)1k\left\| \Delta(Ax)-x \right\|_{2} \le C \frac{\sigma_{k} (x)_{1}}{\sqrt{k}}

then AA satisfies the NSP of order 2k2k.

It shows that if any algorithm can robustly recover signals from compressive measurements, then the matrix AA must satisfy the NSP of order 2k2k .

Practicality

Like the spark, it is generally hard to verify directly for a given matrix.

3. Restricted Isometry Property (RIP)

The RIP is the most powerful property discussed, forming the basis for many of the strongest theoretical guarantees in compressed sensing.

Definition

A matrix AA satisfies the RIP of order kk if there exists a δk(0,1)\delta_k \in (0,1) such that for all k-sparse signals xΣkx \in \Sigma_k:

(1δk)x22Ax22(1+δk)x22(1-\delta_k)||x||_2^2 \le ||Ax||_2^2 \le (1+\delta_k)||x||_2^2

  • δk\delta_k is the “restricted isometry constant” of order kk.

In other words, a matrix AA satisfies the RIP of order kk if it approximately preserves the Euclidean length of all k-sparse vectors.

What it Guarantees

The RIP is a sufficient condition for many different algorithms (like l1l_1 minimization) to provide stable and robust recovery of sparse signals, even in the presence of noise.

Key Requirements & Properties

1. Necessity for Stability

C-stable definition: Let A:RnRmA : \mathbb{R}^n \rightarrow \mathbb{R}^m denote a sensing matrix and Δ:RmRn\Delta : \mathbb{R}^m \rightarrow \mathbb{R}^n denote a recovery algorithm. We say that the pair (A,Δ)(A, \Delta) is C-stable if for any xΣkx \in \Sigma_{k} and any eRme \in \mathbb{R}^{m} we have that

Δ(Ax+e)x2Ce2\left\| \Delta(Ax+e)-x \right\|_{2} \le C \left\| e \right\|_{2}

This definition simply says that if we add a small amount of noise to the measurements, then the impact of this on the recovered signal should not be arbitrarily large.

Theorem 1.3: If a pair (A,Δ)(A, \Delta) is C-stable, then

1Cx2Ax2\frac{1}{C} \left\| x \right\|_{2} \le \left\| Ax \right\|_{2}

for all xΣ2kx \in \Sigma_{2k}.

Theorem 1.3 shows that RIP’s lower bound is necessary for any recovery algorithm to be stable against measurement noise.

2. Lower Bound on mm

Theorem 1.4: Let AA be an m×nm \times n matrix that satisfies the RIP of order 2k2k with constant δ2k(0,12]\delta _{2k} \in \left(0, \frac{1}{2}\right]. Then

mCklog(nk)m \ge Ck\log \left(\frac{n}{k}\right)

where CC is a constant depending only on δ2k\delta_{2k}.

The restriction to δ2k1/2\delta_{2k} \le 1/2 is arbitrary and is made merely for convenience - minor modifications to the argument establish bounds for δ2kδmax\delta_{2k} \le \delta_{max} for any δmax<1\delta_{max} \lt 1.

Theorem 1.4 provides a precise lower bound: m>Cklog(n/k)m > Ck \log(n/k) (i.e. m=O(klog(n/k))m = O(k \log (n/k))). This is necessary to satisfy the RIP and tells us the fundamental minimum number of measurements needed.

3. Relationship to NSP

Theorem 1.5: Suppose that AA satisfies the RIP of order 2k2k with δ2k<21\delta_{2k} \lt \sqrt{2} - 1. Then AA satisfies the NSP of order 2k2k with constant

C=21(1+2)δ2kC = \frac{2}{1 - (1 + \sqrt{2})\delta_{2k}}

RIP is a stronger condition than NSP. Theorem 1.5 shows that if a matrix satisfies the RIP (with a sufficiently small constant), it automatically satisfies the NSP.

Practicality

It is computationally infeasible to verify the RIP for large matrices. However, its main power comes from the fact that random matrices can be proven to satisfy the RIP with very high probability when mm is on the order of klog(n/k)k \log(n/k).

4. Coherence

Coherence is the most practical and easily computable property, though it often leads to stricter requirements.

Definition

The coherence μ(A)\mu(A) of a matrix AA is the largest absolute inner product between any two distinct normalized columns ai,aja_i, a_j:

μ(A)=max1i<jnai,ajai2aj2\mu(A) = \max_{1 \le i < j \le n} \frac{|\langle a_i, a_j \rangle|}{||a_i||_2 ||a_j||_2}

  • aia_i and aja_j are columns of the matrix AA.

What it Guarantees

It provides simple, checkable conditions for unique recovery and stability.

Key Requirements & Properties

1. Uniqueness Condition

Theorem 1.7: If

k<12(1+1μ(A))k \lt \frac{1}{2} \left( 1 + \frac{1}{\mu (A)} \right)

then for each measurement vector yRmy \in \mathbb{R}^m there exists at most one signal xΣkx \in \Sigma_{k} such that y=Axy = Ax.

This uses coherence to provide a sufficient condition for uniqueness. This often requires kk to be much smaller than what RIP would allow.

It is possible to show that the coherence of a matrix is always in the range μ(A)[nmm(n1),1]\mu (A) \in \left[\sqrt{\frac{n−m}{m(n−1)}}, 1 \right] ; the lower bound is known as the Welch bound

Theorem 1.7, together with the Welch bound, provides an upper bound on the level of sparsity kk that guarantees uniqueness using coherence: k=O(m)k = O(\sqrt{m}).

2. Relationship to Spark

Lemma 1.4: For any matrix AA,

spark(A)1+1μ(A)spark(A) \ge 1 + \frac{1}{\mu (A)}

Coherence provides a lower bound on the spark.

3. Relationship to RIP

Lemma 1.5: If AA has unit-norm columns and coherence μ=μ(A)\mu = \mu (A), then AA satisfies the RIP of order k with δk=(k1)μ\delta_{k} = (k-1) \mu for all k<1/μk \lt 1/\mu.

Coherence can be used to establish a (sometimes loose) RIP constant for a matrix .

Practicality

Coherence is the only property listed that is easy to compute for any given matrix.

Summary Table

Property Definition What it Guarantees Key Requirement(s) on kk or mm Practicality
Spark Smallest # of linearly dependent columns Necessary and sufficient condition: uniqueness for k-sparse signals spark(A)>2kspark(A) > 2k (implies m>=2km >= 2k) Hard to compute
NSP Null space vectors are not “sparse” Necessary for any robust/stable recovery Must hold for order 2k Hard to verify directly
RIP Near-isometry for sparse vectors Sufficient for stable/robust recovery for many algorithms m=O(klog(n/k))m = O(k \log(n/k)) Hard to verify, but some random matrices highly likely satisfy it
Coherence (μ\mu) Max inner product between columns Uniqueness and stability, but with stricter conditions kk may be small, typically O(1/μ)O(1/\mu) Easy to compute

Understand different error bounds of signal recovery via l1 minimization

Here is an organization of the signal recovery theorems for l1l_1 minimization, grouped by the main assumption made about the sensing matrix A.

Group 1: Guarantees Based on the Restricted Isometry Property (RIP)

These theorems rely on the Restricted Isometry Property (RIP), a strong theoretical condition ensuring that the matrix A nearly preserves the length of sparse vectors.

1.1 The Ideal Case: Noise-Free Recovery

This is the baseline scenario where measurements are perfect.

  • Theorem: Theorem 1.8
  • Conditions: A satisfies the RIP of order 2k, and the measurements are exact (y=Axy=Ax).
  • Algorithm: Basis Pursuit (minz1\min ||z||_1 subject to Az=yAz=y).
  • Error Bound: The recovery error is bounded by how “compressible” the original signal x is:

    x^x2C0σk(x)1k||\hat{x}-x||_{2}\le C_{0}\frac{\sigma_{k}(x)_{1}}{\sqrt{k}}

    • If the signal x is truly k-sparse, then its approximation error σk(x)1\sigma_k(x)_1 is zero, and the recovery is exact (x^x2=0||\hat{x}-x||_2 = 0).

1.2 The Realistic Case: Bounded Noise

This scenario assumes the measurement noise is contained within a known energy level.

  • Theorem: Theorem 1.9
  • Conditions: A satisfies RIP. The measurement noise e is bounded by e2ϵ||e||_2 \le \epsilon.
  • Algorithm: A quadratically constrained version of Basis Pursuit (minz1\min ||z||_1 subject to Azy2ϵ||Az-y||_2 \le \epsilon).
  • Error Bound: The error has two parts: one from the signal’s compressibility and one from the noise:

    x^x2C0σk(x)1k+C2ϵ||\hat{x}-x||_{2}\le C_{0}\frac{\sigma_{k}(x)_{1}}{\sqrt{k}}+C_{2}\epsilon

1.3 A Specific Noise Model: Gaussian Noise

This is a very common scenario in practice. There are two main approaches.

  • Theorem: Corollary 1.1 (derived from Theorem 1.9)

    • Conditions: A satisfies RIP, and the noise is i.i.d. Gaussian.
    • Guarantee: The error is proportional to the number of measurements m and the noise variance σ\sigma: x^x2mσ||\hat{x}-x||_2 \propto \sqrt{m}\sigma.
  • Theorem: Corollary 1.2 (derived from the Dantzig Selector, Theorem 1.10)

    • Conditions: A has unit-norm columns, satisfies RIP, and the noise is i.i.d. Gaussian.
    • Guarantee: The error is proportional to the sparsity k, the dimension n, and the noise variance σ\sigma: x^x2klognσ||\hat{x}-x||_2 \propto \sqrt{k \log n}\sigma. This is often a much tighter and more useful bound, as k log n is typically much smaller than m.

Group 2: Guarantees Based on Coherence

These theorems rely on coherence (μ\mu), which measures the maximum correlation between any two columns of A. Coherence is easier to calculate than RIP but often leads to stricter conditions on the signal’s sparsity k.

  • Theorem: Theorem 1.11

    • Conditions: A has a certain coherence μ\mu, and the sparsity k must be small relative to it (k<(1/μ+1)/4k < (1/\mu+1)/4).
    • Error Bound: Provides a worst-case error bound for bounded noise:

      xx^2e2+ϵ1μ(4k1)||x-\hat{x}||_{2}\le\frac{||e||_{2}+\epsilon}{\sqrt{1-\mu(4k-1)}}

  • Theorem: Theorem 1.12 (using the LASSO algorithm)

    • Conditions: A has coherence μ\mu, the sparsity k is small (k1/(3μ)k \le 1/(3\mu)), and the noise is Gaussian.
    • Error Bound: In addition to an l2l_2 error bound, this provides a powerful guarantee on recovering the correct sparse structure: with high probability, the recovered non-zero entries are a subset of the true non-zero entries (supp(x^)supp(x)supp(\hat{x}) \subset supp(x)).

Group 3: Theorems on the Nature of the Error Bounds

This final group explains why the error bounds from the RIP-based theorems look the way they do.

  • The “Pessimistic” Result:

    • Theorem: Theorem 1.13
    • What it says: It is impossible for any recovery algorithm to provide a “simple” error bound of the form x^x2Cσk(x)2||\hat{x}-x||_2 \le C\sigma_k(x)_2 unless you take a huge number of measurements (mnm \approx n). This explains why the theorems use the more complicated σk(x)1/k\sigma_k(x)_1/\sqrt{k} term.
  • The “Optimistic” Result (with Randomness):

    • Theorem: Theorem 1.14
    • What it says: This result shows a way around the limitation of Theorem 1.13. If you use a randomized sensing matrix A, you can achieve the “simple” error bound x^x2Cσk(x)2||\hat{x}-x||_2 \le C\sigma_k(x)_2 with high probability, while still using very few measurements. This highlights a key advantage of using random matrices in compressed sensing.

Deterministic or Probabilistic Guarantees

Key Recovery Formulations

The following theorems provide guarantees for solutions to sparse recovery problems. These problems are typically formulated as optimizations, with the goal of finding a signal xx that is both sparse and consistent with the measurements yy. The chapter focuses on the following formulations.

  • The Ideal l0l_0 Minimization (Computationally Intractable)
    This is the direct statement of finding the sparsest solution, but it is NP-hard and cannot be solved efficiently for large problems .

    x^=argminzz0subject tozB(y)(1.10)\hat{x} = \arg\min_{z} ||z||_0 \quad \text{subject to} \quad z\in\mathcal{B}(y) \quad \text{(1.10)}

  • The Constrained l1l_1 Minimization (Basis Pursuit-style)
    This is the practical, convex relaxation of the l0l_0 problem and is the focus of most of the theorems. The constraint set B(y)\mathcal{B}(y) changes based on the assumptions about noise.

    x^=argminzz1subject tozB(y)(1.12)\hat{x} = \arg\min_{z} ||z||_1 \quad \text{subject to} \quad z\in\mathcal{B}(y) \quad \text{(1.12)}

  • The Unconstrained l1l_1 Minimization (LASSO)
    This is an alternative, unconstrained formulation that is equivalent to the constrained version for a specific choice of the regularization parameter λ\lambda .

    x^=argminz12Azy22+λz1(1.15)\hat{x} = \arg\min_z \frac{1}{2} ||Az-y||_2^2 + \lambda ||z||_1 \quad \text{(1.15)}

Note that for (1.15) some choice of the parameter λ\lambda this optimization problem will yield the same result as the constrained version of the problem given by (1.12) with B(y)={z:Azy2ϵ}\mathcal{B}(y)=\{z:||Az-y||_{2}\le\epsilon\}.
However, in general the value of λ\lambda which makes these problems equivalent is unknown a priori.


Types of Guarantees

The chapter presents two main flavors of guarantees:

  1. Deterministic Guarantees: These are absolute “if-then” statements. If you have a matrix AA with a specific property (like RIP), then a certain error bound is guaranteed to hold. The challenge is that deterministically constructing a matrix AA with these properties for large systems can be very difficult.
  2. Probabilistic Guarantees: These state that an error bound will hold with high probability, where the probability comes from the random choice of the matrix AA or the random nature of the noise. The most common approach in compressed sensing is to choose the matrix A randomly. The guarantee then is that with very high probability, the chosen matrix will have the desired properties (like RIP) and the recovery will be successful.

Group 1: Guarantees Based on the Restricted Isometry Property (RIP)

These theorems rely on the RIP, a strong theoretical condition ensuring the matrix AA nearly preserves the length of sparse vectors.

1.1 The Ideal Case: Noise-Free Recovery

This is the baseline scenario where measurements are perfect.

Theorem 1.8

Suppose that AA satisfies the RIP of order 2k2k with δ2k<21\delta_{2k}<\sqrt{2}-1 and we obtain measurements of the form y=Axy=Ax. Then when B(y)={z:Az=y}\mathcal{B}(y)=\{z: Az=y\}, the solution x^\hat{x} to (1.12) obeys

x^x2C0σk(x)1k||\hat{x}-x||_{2}\le C_{0}\frac{\sigma_{k}(x)_{1}}{\sqrt{k}}

where

C0=21(12)δ2k1(1+2)δ2kC_0 = 2 \frac{1 - (1 - \sqrt{2}) \delta_{2k}}{1 - (1 + \sqrt{2}) \delta_{2k}}

  • Type of Guarantee: Deterministic.
  • Conditions: AA satisfies the RIP of order 2k2k, and the measurements are exact (y=Axy=Ax).
Error Bound

x^x2C0σk(x)1k ||\hat{x}-x||_{2}\le C_{0}\frac{\sigma_{k}(x)_{1}}{\sqrt{k}}

If the signal x is truly k-sparse, then its approximation error σk(x)1\sigma_k(x)_1 is zero, and the recovery is exact (x^x2=0||\hat{x}-x||_2 = 0).


1.2 The Realistic Case: Bounded Noise

This scenario assumes the measurement noise is contained within a known energy level.

Theorem 1.9

Suppose that AA satisfies the RIP of order 2k2k with δ2k<21\delta_{2k}<\sqrt{2}-1 and let y=Ax+ey=Ax+e where e2ϵ||e||_{2}\le\epsilon. Then when B(y)={z:Azy2ϵ}\mathcal{B}(y)=\{z:||Az-y||_{2}\le\epsilon\} the solution x^\hat{x} to (1.12) obeys

x^x2C0σk(x)1k+C2ϵ||\hat{x}-x||_{2}\le C_{0}\frac{\sigma_{k}(x)_{1}}{\sqrt{k}}+C_{2}\epsilon

where

C0=21(12)δ2k1(1+2)δ2k,C2=41+δ2k1(1+2)δ2kC_0 = 2 \frac{1 - (1 - \sqrt{2}) \delta_{2k}}{1 - (1 + \sqrt{2}) \delta_{2k}}, C_2 = 4 \frac{\sqrt{1 + \delta_{2k}}}{1 - (1 + \sqrt{2}) \delta_{2k}}

  • Type of Guarantee: Deterministic.
  • Conditions: AA satisfies RIP. The measurement noise ee is bounded by e2ϵ||e||_2 \le \epsilon.
Error Bound

The error has two parts: one from the signal’s compressibility and one from the noise.

x^x2C0σk(x)1k+C2ϵ||\hat{x}-x||_{2}\le C_{0}\frac{\sigma_{k}(x)_{1}}{\sqrt{k}}+C_{2}\epsilon


Theorem 1.10

Suppose that AA satisfies the RIP of order 2k2k with δ2k<21\delta_{2k}<\sqrt{2}-1 and we obtain measurements of the form y=Ax+ey=Ax+e where ATeλ.||A^{T}e||_{\infty}\le\lambda. Then when B(y)={z:AT(Azy)λ}\mathcal{B}(y)=\{z:||A^{T}(Az-y)||_{\infty}\le\lambda\} the solution x^\hat{x} to (1.12) obeys

x^x2C0σk(x)1k+C3kλ||\hat{x}-x||_{2}\le C_{0}\frac{\sigma_{k}(x)_{1}}{\sqrt{k}}+C_{3}\sqrt{k}\lambda

where

C0=21(12)δ2k1(1+2)δ2k,C3=421(1+2)δ2kC_0 = 2 \frac{1 - (1 - \sqrt{2}) \delta_{2k}}{1 - (1 + \sqrt{2}) \delta_{2k}}, C_3 = \frac{4 \sqrt{2}}{1 - (1 + \sqrt{2}) \delta_{2k}}

  • Type of Guarantee: Deterministic.
  • Conditions: AA satisfies RIP. The noise is bounded such that ATeλ||A^T e||_\infty \le \lambda.
  • Algorithm: The Dantzig Selector (Eq. 1.12 with B(y)={z:AT(Azy)λ}\mathcal{B}(y)=\{z:||A^{T}(Az-y)||_{\infty}\le\lambda\}).
  • Error Bound: x^x2C0σk(x)1k+C3kλ||\hat{x}-x||_{2}\le C_{0}\frac{\sigma_{k}(x)_{1}}{\sqrt{k}}+C_{3}\sqrt{k}\lambda

1.3 A Specific Noise Model: Gaussian Noise

This is a very common scenario in practice. There are two main approaches.

Corollary 1.1 (derived from Theorem 1.9)

Suppose that AA satisfies the RIP of order 2k2k with δ2k<21\delta_{2k}<\sqrt{2}-1. Furthermore, suppose that xΣkx\in\Sigma_{k} and that we obtain measurements of the form y=Ax+ey=Ax+e where the entries of e are i.i.d. N(0,σ2)\mathcal{N}(0,\sigma^{2}). Then when B(y)={z:Azy22mσ}\mathcal{B}(y)=\{z:||Az-y||_{2}\le 2\sqrt{m}\sigma\}, the solution x^\hat{x} to (1.12) obeys

x^x281+δ2k1(1+2)δ2kmσ||\hat{x}-x||_{2}\le8\frac{\sqrt{1+\delta_{2k}}}{1-(1+\sqrt{2})\delta_{2k}}\sqrt{m}\sigma

with probability at least 1exp(c0m)1-\exp(-c_{0}m), where c0>0c_0 \gt 0 is a constant.

  • Type of Guarantee: Probabilistic (over the random draw of the noise).
  • Conditions: AA satisfies RIP, noise is i.i.d. Gaussian, and xx is k-sparse.
Error Bound

The error is proportional to the sqrt of number of measurements mm and the noise variance

x^x281+δ2k1(1+2)δ2kmσ||\hat{x}-x||_{2}\le8\frac{\sqrt{1+\delta_{2k}}}{1-(1+\sqrt{2})\delta_{2k}}\sqrt{m}\sigma


Corollary 1.2 (derived from the Dantzig Selector, Theorem 1.10)

Suppose that AA has unit-norm columns and satisfies the RIP of order 2k2k with δ2k<21.\delta_{2k}<\sqrt{2}-1. Furthermore, suppose that xΣkx\in\Sigma_{k} and that we obtain measurements of the form y=Ax+ey=Ax+e where the entries of ee are i.i.d. N(0,σ2)\mathcal{N}(0,\sigma^{2}). Then when B(y)={z:AT(Azy)2lognσ}\mathcal{B}(y)=\{z:||A^{T}(Az-y)||_{\infty}\le 2\sqrt{\log n} \sigma\}, the solution x^\hat{x} to (1.12) obeys

x^x2421+δ2k1(1+2)δ2kklognσ||\hat{x}-x||_{2}\le 4\sqrt{2}\frac{\sqrt{1+\delta_{2k}}}{1-(1+\sqrt{2})\delta_{2k}}\sqrt{k \log n} \sigma

with probability at least 11n1-\frac{1}{n}.

  • Type of Guarantee: Probabilistic (over the random draw of the noise).
  • Conditions: AA satisfies RIP, has unit-norm columns, noise is i.i.d. Gaussian, and xx is k-sparse.
Error Bound

The error is proportional to the sqrt of sparsity kk, the sqrt of log of dimension nn, and the noise variance σ\sigma: x^x2klognσ||\hat{x}-x||_2 \propto \sqrt{k \log n}\sigma. This might be a more useful bound, as klognk \log n may be much smaller than mm.

x^x2421+δ2k1(1+2)δ2kklognσ||\hat{x}-x||_{2}\le4\sqrt{2}\frac{\sqrt{1+\delta_{2k}}}{1-(1+\sqrt{2})\delta_{2k}}\sqrt{k \log n} \sigma


Group 2: Guarantees Based on Coherence

These theorems rely on coherence (μ\mu), which measures the maximum correlation between any two columns of AA. Coherence is easier to calculate than RIP but often leads to stricter conditions on the signal’s sparsity kk (the signal should be sparser).

Theorem 1.11

Suppose that AA has coherence μ\mu and that xΣkx\in\Sigma_{k} with k<(1/μ+1)/4.k<(1/\mu+1)/4. Furthermore, suppose that we obtain measurements of the form y=Ax+e.y=Ax+e. Then when B(y)={z:Azy2ϵ},\mathcal{B}(y)=\{z:||Az-y||_{2}\le\epsilon\}, the solution x^\hat{x} to (1.12) obeys

xx^2e2+ϵ1μ(4k1)||x-\hat{x}||_{2}\le\frac{||e||_{2}+\epsilon}{\sqrt{1-\mu(4k-1)}}

  • Type of Guarantee: Deterministic.
  • Conditions: kk is small (k<(1/μ+1)/4k < (1/\mu+1)/4), and xx is k-sparse.
Error Bound

xx^2e2+ϵ1μ(4k1)||x-\hat{x}||_{2}\le\frac{||e||_{2}+\epsilon}{\sqrt{1-\mu(4k-1)}}

Provides a worst-case analysis error bound for bounded noise, will typically overestimate the actual error.


Theorem 1.12

Suppose that AA has coherence μ\mu and that xΣkx\in \Sigma_k with k1/(3μ)k\le1/(3\mu). Furthermore, suppose that we obtain measurements of the form y=Ax+ey=Ax+e where the entries of e are i.i.d. N(0,σ2)\mathcal{N}(0,\sigma^{2}). Set

λ=8σ2(1+α)log(nk)\lambda = \sqrt{8\sigma^2(1+\alpha) \log (n-k)}

for some fairly small value α>0\alpha>0. Then with probability exceeding

(11(nk)α)(1exp(k/7))\left(1-\frac{1}{(n-k)^{\alpha}}\right)(1-\exp (-k/7))

the solution x^\hat{x} to (1.15) is unique, supp(x^)supp(x)\text{supp}(\hat{x})\subset \text{supp}(x) and has the following error bound.

  • Type of Guarantee: Probabilistic (over the random draw of the noise).
  • Conditions: kk is small (k1/(3μ)k \le 1/(3\mu)), noise is i.i.d. Gaussian, and xx is k-sparse.
  • Algorithm: LASSO (Eq. 1.15).
Error Bound

x^x22(3+32(1+α)log(nk))2kσ2||\hat{x}-x||_{2}^{2} \le \left(\sqrt{3}+3\sqrt{2(1+\alpha)\log(n-k)}\right)^{2}k\sigma^{2}

In addition to an l2l_2 error bound, this provides a powerful guarantee on recovering the correct sparse structure: with high probability, the recovered non-zero entries are a subset of the true non-zero entries (supp(x^)supp(x)\text{supp}(\hat{x}) \subset \text{supp}(x)).


Group 3: Theorems on the Nature of the Error Bounds

This final group explains why the error bounds from the RIP-based theorems look the way they do.

Theorem 1.13 (The “Pessimistic” Result)

Suppose that AA is an m×nm\times n matrix and that Δ:RmRn\Delta:\mathbb{R}^{m}\rightarrow\mathbb{R}^{n} is a recovery algorithm that satisfies

xΔ(Ax)2Cσk(x)2||x-\Delta(Ax)||_{2}\le C\sigma_{k}(x)_{2}

for some k1,k\ge1, then m>(111/C2)n.m \gt \left(1-\sqrt{1-1/C^{2}}\right)n.

  • Type of Guarantee: Deterministic.
  • What it says: It is impossible for any algorithm with a deterministic matrix AA to satisfy the “simple” error bound x^x2Cσk(x)2||\hat{x}-x||_{2}\le C\sigma_{k}(x)_{2} unless you take a huge number of measurements (mnm \approx n). This explains why the theorems use σk(x)1/k\sigma_k(x)_1/\sqrt{k} term.

Theorem 1.14 (The “Optimistic” Result with Randomness)

Let xRnx\in\mathbb{R}^{n} be fixed. Set δ2k<21.\delta_{2k}<\sqrt{2}-1. Suppose that AA is an m×nm\times n sub-gaussian random matrix with m=O(klog(n/k)/δ2k2).m=O(k \log (n/k)/\delta_{2k}^{2}). Suppose we obtain measurements of the form y=Axy=Ax. Set ϵ=2σk(x)2\epsilon=2\sigma_{k}(x)_{2}. Then with probability exceeding 12exp(c1δ2k2m)exp(c0m)1-2\exp(-c_{1}\delta_{2k}^{2}m)-\exp(-c_{0}m), when B(y)={z:Azy2ϵ}\mathcal{B}(y)=\{z:||Az-y||_{2}\le\epsilon\}, the solution x^\hat{x} to (1.12) obeys the following bound.

  • Type of Guarantee: Probabilistic (over the random choice of the matrix A).
  • What it says: If you use a random sensing matrix AA, you can achieve the “simple” error bound with high probability, using far fewer measurements.
  • Error Bound:

x^x281+δ2k(1+2)δ2k1(1+2)δ2kσk(x)2||\hat{x}-x||_{2}\le\frac{8\sqrt{1+\delta_{2k}}-(1+\sqrt{2})\delta_{2k}}{1-(1+\sqrt{2})\delta_{2k}}\sigma_{k}(x)_{2}

Instance-Optimal Guarantees

An “instance-optimal” guarantee means the quality of the recovery error bound adapts to the specific signal instance xx that you are measuring. Instead of a single worst-case error for all signals, the error bound is a function of the properties of that particular signal, such as how compressible it is (e.g. measured by σk(x)1\sigma_k(x)_1 or σk(x)2\sigma_k(x)_2).

In deterministic cases, examples are Theorem 1.8, Theorem 1.9, Theorem 1.10.

This concept is then combined with probabilistic guarantees in two different ways.

The Two “Flavors” of Probabilistic Guarantees

The distinction comes down to whether one “good” random matrix AA works for all signals, or if the guarantee only applies to a specific signal-matrix pair.

1. The Stronger Guarantee: A “Universal” Random Matrix

This is the typical approach.

  • The Process: You generate a random matrix AA once.
  • The Guarantee: With very high probability, this single matrix AA you generated is “good.” A “good” matrix is one that satisfies a deterministic guarantee (like the RIP-based theorems) for all possible signals xx.

2. The Weaker Guarantee: “Instance-Optimal in Probability”

This is what Theorem 1.14 provides.

  • The Process: You are first given a specific signal instance xx. Then, you generate a random matrix AA to measure it.
  • The Guarantee: With high probability, the recovery of that specific signal xx using that specific matrix AA will be successful. The guarantee is probabilistic and only applies to the instance you started with. It doesn’t promise anything about how that same matrix AA would perform on a different signal.

In summary, “instance-optimal in probability” is the weaker type of probabilistic guarantee where the success probability is tied to a specific signal instance, and you might need to draw a new random matrix for each new signal.

Non Instance-Optimal Guarantees

Examples are Corollaries 1.1, 1.2 and Theorems 1.11, 1.12.

These guarantees provide a single, uniform error bound that applies to the entire class of k-sparse signals. As long as xx is k-sparse, the error bound is the same regardless of which specific k-sparse signal it is. The bound depends on parameters like mm, nn, kk, and the noise level, but not on the structure of xx beyond its k-sparsity.