250624 Compressed Sensing Chapter 1 note

Posted on 2025-06-24 Edited on 2025-07-09 In research Disqus: Word count in article: 38k Reading time ≈ 35 mins.

Fundamental Concepts from Set Theory & Linear Algebra

Subset
A set A is a subset of another set B if all elements of A are also elements of B. This is denoted as $A \subset B$ .
- Example from the text: The chapter discusses selecting a portion of a signal’s coefficients. If you have a signal with n coefficients, the set of all coefficient indices is $\{1, 2, ..., n\}$ . An index set $\Lambda$ that points to the locations of the k-largest entries is a subset of the full set of indices, so $\Lambda \subset \{1, 2, ..., n\}$ .
Vector Space
A vector space is a collection of objects called vectors, which can be added together and multiplied by scalars (numbers). The key idea is that if you take any two vectors from the space and add them, you get another vector that is still within that space. Similarly, scaling a vector keeps it within the space. The text models signals as vectors in a vector space.
Span
The span of a set of vectors is the set of all possible vectors you can create by taking linear combinations of them. In other words, if you have vectors $v_1, v_2, ..., v_k$ , their span is the set of all vectors $w$ that can be written as $w = c_1v_1 + c_2v_2 + ... + c_kv_k$ for some scalar coefficients $c_i$ .
Linear Independence
A set of vectors is linearly independent if no vector in the set can be written as a linear combination of the others. A more formal way to say this is that the only way to get the zero vector from their linear combination ( $c_1v_1 + c_2v_2 + ... + c_kv_k = 0$ ) is if all the scalar coefficients ( $c_1, c_2, ...$ ) are zero.
Subspace
A subspace is a vector space that is contained within another, larger vector space. To be a subspace, a set of vectors must satisfy three rules:
1. It must contain the zero vector.
2. It must be “closed under addition”: if you add any two vectors from the set, their sum is also in the set.
3. It must be “closed under scalar multiplication”: if you multiply any vector in the set by a scalar, the result is also in the set.
- Example from the text: The set of all 2-sparse signals in $\mathbb{R}^3$ is not a single subspace, because adding two 2-sparse signals can result in a 3-sparse signal. Instead, it’s a union of subspaces, where each subspace is a plane defined by two coordinate axes (like the x-y plane, x-z plane, etc.).
Null Space
The null space of a matrix $A$ , denoted $\mathcal{N}(A)$ , is the set of all vectors $z$ that give the zero vector when multiplied by $A$ .

$\mathcal{N}(A) = \{z : Az = 0\}$
- Relevance in CS: In compressed sensing, if you have two different sparse signals, $x$ and $x'$ , that produce the same measurement $y$ (i.e., $Ax = Ax' = y$ ), then their difference, $h = x - x'$ , must be in the null space of $A$ because $A(x-x')=0$ . To guarantee unique recovery, the null space of $A$ should not contain any sparse vectors (other than the zero vector).

$\mathbb{R}^n$ (n-dimensional Euclidean Space)

Definition: The symbol $\mathbb{R}^n$ refers to the n-dimensional Euclidean space. It is the set of all vectors that consist of n real-number elements.
Breakdown of the Notation:
- The $\mathbb{R}$ stands for the set of all real numbers.
- The superscript n indicates the dimension, meaning each vector in this set is an ordered list of n real numbers, like $[x_1, x_2, ..., x_n]$ .
Context from the Text: The chapter states that signals can be viewed as vectors in an n-dimensional Euclidean space, which is denoted by $\mathbb{R}^n$ . For example, $\mathbb{R}^2$ is the 2D plane, and $\mathbb{R}^3$ is 3D space.

Basis

Definition: A set of vectors $\{\phi_{i}\}_{i=1}^{n}$ ${ϕ_{i}}_{i = 1}^{n}$ is called a basis for $\mathbb{R}^n$ $R^{n}$ if the vectors in the set span $\mathbb{R}^n$ $R^{n}$ and are linearly independent. This means two things:
1. Span the space: The vectors in the set can be scaled and added in some combination to create any vector in the entire space.
2. Linearly Independent: There are no redundant vectors in the set. No vector in the basis can be created from a combination of the others.
Analogy: A basis is like a set of fundamental directions for a space. For the 2D plane ( $\mathbb{R}^2$ ), the standard basis is the pair of vectors [1, 0] (the x-direction) and [0, 1] (the y-direction). You can get to any point on the plane by moving some amount in the x-direction and some amount in the y-direction.

Dimension

Definition: The dimension of a vector space is the number of vectors in any of its bases.
Key Property: For any given vector space, although there can be many different sets of basis vectors, every basis for that space will have the exact same number of vectors. This unique number defines the dimension of the space.
- For example, any basis for the 2D plane ( $\mathbb{R}^2$ ) must have exactly two vectors, so its dimension is 2.
- A line passing through the origin is a 1-dimensional subspace because its basis consists of only one vector.
Rank
The rank of a matrix is the maximum number of linearly independent columns (or, equivalently, rows) in the matrix. It represents the dimension of the subspace spanned by its columns.
- Relevance in CS: The concept of rank is central to the low-rank matrix model. Just as a sparse vector has few non-zero entries, a low-rank matrix has “few” linearly independent columns/rows, meaning it contains less information than its ambient dimensions suggest. This structure is another type of simplicity that can be exploited for recovery from limited data.

Affine space

An affine space is essentially a vector subspace that has been shifted so that it does not necessarily pass through the origin.

Here’s a breakdown:

Vector Subspace: In $\mathbb{R}^2$ , a one-dimensional vector subspace is a straight line that must pass through the origin (the point (0,0)).
Affine Space: An affine space is a generalization. It can be a point, a line, or a plane that does not need to go through the origin. It’s created by taking a vector subspace and adding a fixed vector to every point in it, effectively “translating” it.

The “one-dimensional affine space A” is illustrated in Figure 1.2 as a straight line that does not pass through the origin. The goal is to find the point on this line that is closest to the signal vector $x$ .

A bit more math used in Compressed Sensing

Vector Space Concepts

$l_p$ Norms

For a vector $x \in \mathbb{R}^n$ , the $l_p$ norm is defined as:

||x||_p = \begin{cases} \left(\sum_{i=1}^{n}|x_i|^p\right)^{\frac{1}{p}}, & p \in [1, \infty) \\ \max_{i=1,2,...,n}|x_i|, & p=\infty \end{cases}

where:

$x_i$ represents the i-th element of the vector $x$ .

$l_0$ “Norm”

The $l_0$ “norm” counts the number of non-zero elements in a vector:
$||x||_0 := |\text{supp}(x)|$
where:

$\text{supp}(x)$ is the “support” of the vector $x$ , which is the set of indices of its non-zero elements, i.e., $\{i : x_i \ne 0\}$ .
$|\cdot|$ denotes the cardinality (the number of elements) of the set.

Inner Product

The standard inner product in $\mathbb{R}^n$ is defined as:

\langle x,z \rangle = z^T x = \sum_{i=1}^{n} x_i z_i

where:

$z^T$ denotes the transpose of the vector $z$ .

Basis and Frames

Basis: A set of vectors $\{\phi_i\}_{i=1}^{n}$ ${ϕ_{i}}_{i = 1}^{n}$ is a basis for $\mathbb{R}^n$ $R^{n}$ if they span the space and are linearly independent. Any vector $x$ $x$ has a unique representation $x = \sum_{i=1}^{n} c_i \phi_i$ $x = \sum_{i = 1}^{n} c_{i} ϕ_{i}$ . In matrix form, this is $x = \Phi c$ $x = Φ c$ , where:
- $\Phi$ is the $n \times n$ matrix whose columns are the basis vectors $\phi_i$ .
- $c$ is the vector containing the coefficients $c_i$ .
Orthonormal Basis: A basis is orthonormal if $\langle \phi_i, \phi_j \rangle = \begin{cases} 1, & i=j \\ 0, & i \ne j \end{cases}$ .
Frame: A set of vectors $\{\phi_i\}_{i=1}^{n}$ ${ϕ_{i}}_{i = 1}^{n}$ in $\mathbb{R}^d$ $R^{d}$ ( $d<n$ $d < n$ ) is a frame if for any vector $x \in \mathbb{R}^d$ $x \in R^{d}$ , the following holds for constants $0 < A \le B < \infty$ $0 < A \leq B < \infty$ :
$A||x||_2^2 \le ||\Phi^T x||_2^2 \le B||x||_2^2$
- $A$ and $B$ are called the frame bounds. If $A=B$ , the frame is tight.
Dual Frame: A frame $\tilde{\Phi}$ is a dual frame of $\Phi$ if $\Phi\tilde{\Phi}^T = \tilde{\Phi}\Phi^T = I$ , where $I$ is the identity matrix. The canonical dual frame is given by $\tilde{\Phi} = (\Phi\Phi^T)^{-1}\Phi$ , where $(\cdot)^{-1}$ denotes the matrix inverse.

Signal Models

Sparse and Compressible Signals

k-sparse signal: A signal $x$ is k-sparse if it has at most $k$ non-zero entries, i.e., $||x||_0 \le k$ . The set of all k-sparse signals is denoted by $\Sigma_k = \{x: ||x||_0 \le k\}$ .
Compressible signal: A signal is compressible if it can be well-approximated by a sparse signal. The approximation error is given by:
$\sigma_k(x)_p = \min_{\hat{x} \in \Sigma_k} ||x-\hat{x}||_p$
- Here, $\hat{x}$ is the sparse approximation of the signal $x$ from the set $\Sigma_k$ .
Power Law Decay: A signal is compressible if its sorted coefficients $|c_i|$ $∣ c_{i} ∣$ decay according to a power law:
$|c_i| \le C_1 i^{-q}$
- $C_1$ and $q$ are positive constants.

Union of Subspaces and Low-Rank Models

Union of Subspaces: A signal $x$ $x$ lies in a union of $M$ $M$ subspaces if:
$x \in \mathcal{U} = \bigcup_{i=1}^{M} \mathcal{U}_i$
- $\mathcal{U}_i$ are the individual subspaces, and $M$ is the total number of subspaces in the union.
Low-Rank Matrix: The set of low-rank matrices is defined as:
$\mathcal{L}_r = \{M \in \mathbb{R}^{n_1 \times n_2} : \text{rank}(M) \le r\}$
- $M$ is a matrix of size $n_1 \times n_2$ .
- $r$ is the maximum rank of the matrices in the set.

Signal Recovery

$l_1$ Minimization

This involves solving one of the following optimization problems:

Noise-free case (Basis Pursuit): $\hat{x} = \arg\min_{z} ||z||_1 \quad \text{subject to} \quad Az=y$
Bounded noise case: $\hat{x} = \arg\min_{z} ||z||_1 \quad \text{subject to} \quad ||Az-y||_2 \le \epsilon$
Dantzig Selector: $\hat{x} = \arg\min_{z} ||z||_1 \quad \text{subject to} \quad ||A^T(Az-y)||_\infty \le \lambda$
LASSO (unconstrained form): $\hat{x} = \arg\min_z \frac{1}{2} ||Az-y||_2^2 + \lambda||z||_1$

In these formulas:

$\hat{x}$ is the estimated signal.
$\arg\min_{z}$ finds the vector $z$ that minimizes the given objective function.
$\epsilon$ is a constant representing the upper bound on the norm of the noise.
$\lambda$ is a regularization parameter that balances the trade-off between the data fidelity term ( $||Az-y||_2^2$ ) and the sparsity term ( $||z||_1$ ).

Recovery Guarantees

Theorem 1.8 (Noise-Free Recovery): If $A$ satisfies the RIP of order 2k with $\delta_{2k} < \sqrt{2}-1$ , the solution $\hat{x}$ to the noise-free $l_1$ minimization problem obeys: $||\hat{x}-x||_2 \le C_0 \frac{\sigma_k(x)_1}{\sqrt{k}}$
Theorem 1.9 (Bounded Noise): Under similar RIP conditions and with noise $e$ where $||e||_2 \le \epsilon$ , the solution obeys: $||\hat{x}-x||_2 \le C_0 \frac{\sigma_k(x)_1}{\sqrt{k}} + C_2 \epsilon$

In these theorems:

$C_0$ and $C_2$ are constants that depend on the RIP constant $\delta_{2k}$ .
$e$ represents a noise vector corrupting the measurements.

Multiple Measurement Vector (MMV) Problem

This problem involves recovering a set of $l$ sparse vectors $\{x_i\}_{i=1}^l$ that share a common support. These vectors form the columns of a matrix $X$ which is row-sparse.

Uniqueness Condition (Theorem 1.15): A row-sparse matrix $X$ $X$ is uniquely determined by $Y=AX$ $Y = A X$ if:
$|\text{supp}(X)| < \frac{\text{spark}(A)-1+\text{rank}(X)}{2}$
- $Y$ is the $m \times l$ matrix of measurements, with columns being the measurements $y_i$ for each signal $x_i$ .
$l_{p,q}$ Norms: Used for matrix recovery, defined as $||X||_{p,q} = (\sum_i ||x^i||_p^q)^{1/q}$ , where $x^i$ is the i-th row of the matrix $X$ .

Understand matrix properties in compressed sensing

Coherence

The coherence $\mu(A)$ of a matrix $A$ is the largest absolute inner product between any two distinct normalized columns $a_i, a_j$ :

\mu(A) = \max_{1 \le i < j \le n} \frac{|\langle a_i, a_j \rangle|}{||a_i||_2 ||a_j||_2}

$a_i$ and $a_j$ are columns of the matrix $A$ .

Excellent request. The chapter introduces several properties of the sensing matrix A, each with different strengths, guarantees, and requirements. Organizing them is the best way to see the relationships between them.

Here is an organization of the sensing matrix properties discussed in the chapter, from the most fundamental to the most powerful.

1. Spark

This is the most fundamental condition for guaranteeing a unique sparse solution.

Definition

The spark of a matrix $A$ , denoted $spark(A)$ is the smallest number of columns of $A$ that are linearly dependent.

I find the definition on Wikipedia is more precise:
https://en.wikipedia.org/wiki/Spark_(mathematics)
The spark of an $m \times n$ matrix $A$ is the smallest integer $k$ , such that there exists a set of $k$ columns of $A$ that are linearly dependent

What it Guarantees

Theorem 1.1: For any vector $y \in \mathbb{R}^m$ , there is at most one signal $x \in \Sigma_k$ such that $y=Ax$ if and only if $\text{spark}(A) > 2k$ .

It provides the definitive “if and only if” condition for the unique recovery of exactly k-sparse signals in a noise-free setting.

Key Requirement

For unique recovery of any k-sparse signal, the condition is $spark(A) > 2k$ .
Requirement on $m$ : It’s easy to see that $spark(A) \in \left[2, m+1\right]$ . This condition immediately implies that the number of measurements $m$ , i.e., $m \ge 2k$ .

Practicality

Calculating the spark for a general matrix is NP-hard and computationally infeasible for large matrices. (https://en.wikipedia.org/wiki/Spark_(mathematics))

2. Null Space Property (NSP)

The NSP is a more refined condition on the null space of $A$ that is necessary for robust recovery.

Definition

A matrix $A$ has the Null Space Property of order $k$ if for a constant $C>0$ , the following holds for all $h \in \mathcal{N}(A)$ and all index sets $\Lambda$ with $|\Lambda| \le k$ :

||h_\Lambda||_2 \le C \frac{||h_{\Lambda^c}||_1}{\sqrt{k}}

$\mathcal{N}(A)$ is the null space of matrix $A$ , which is the set of all vectors $z$ for which $Az=0$ .
$\Lambda$ is a subset of indices $\{1, 2, ..., n\}$ .
$\Lambda^c$ is the complement of $\Lambda$ .
$h_\Lambda$ is a vector formed by keeping the entries of $h$ indexed by $\Lambda$ and setting all other entries to zero.
$h_{\Lambda^c}$ is a vector formed by keeping the entries of $h$ indexed by $\Lambda^c$ and setting all other entries to zero.

In other words, a matrix $A$ satisfying the NSP of order $k$ means that vectors in its null space are not overly concentrated on a small set of indices.

What it Guarantees

The NSP is presented as a necessary condition for any stable recovery algorithm.

Theorem 1.2: Let $A : \mathbb{R}^n \rightarrow \mathbb{R}^m$ denote a sensing matrix and $\Delta : \mathbb{R}^m \rightarrow \mathbb{R}^n$ denote an arbitrary recovery algorithm. If the pair $(A, \Delta)$ satisfies

\left\| \Delta(Ax)-x \right\|_{2} \le C \frac{\sigma_{k} (x)_{1}}{\sqrt{k}}

then $A$ satisfies the NSP of order $2k$ .

It shows that if any algorithm can robustly recover signals from compressive measurements, then the matrix $A$ must satisfy the NSP of order $2k$ .

Practicality

Like the spark, it is generally hard to verify directly for a given matrix.

3. Restricted Isometry Property (RIP)

The RIP is the most powerful property discussed, forming the basis for many of the strongest theoretical guarantees in compressed sensing.

Definition

A matrix $A$ satisfies the RIP of order $k$ if there exists a $\delta_k \in (0,1)$ such that for all k-sparse signals $x \in \Sigma_k$ :

(1-\delta_k)||x||_2^2 \le ||Ax||_2^2 \le (1+\delta_k)||x||_2^2

$\delta_k$ is the “restricted isometry constant” of order $k$ .

In other words, a matrix $A$ satisfies the RIP of order $k$ if it approximately preserves the Euclidean length of all k-sparse vectors.

What it Guarantees

The RIP is a sufficient condition for many different algorithms (like $l_1$ minimization) to provide stable and robust recovery of sparse signals, even in the presence of noise.

Key Requirements & Properties

1. Necessity for Stability

C-stable definition: Let $A : \mathbb{R}^n \rightarrow \mathbb{R}^m$ denote a sensing matrix and $\Delta : \mathbb{R}^m \rightarrow \mathbb{R}^n$ denote a recovery algorithm. We say that the pair $(A, \Delta)$ is C-stable if for any $x \in \Sigma_{k}$ and any $e \in \mathbb{R}^{m}$ we have that

\left\| \Delta(Ax+e)-x \right\|_{2} \le C \left\| e \right\|_{2}

This definition simply says that if we add a small amount of noise to the measurements, then the impact of this on the recovered signal should not be arbitrarily large.

Theorem 1.3: If a pair $(A, \Delta)$ is C-stable, then

\frac{1}{C} \left\| x \right\|_{2} \le \left\| Ax \right\|_{2}

for all $x \in \Sigma_{2k}$ .

Theorem 1.3 shows that RIP’s lower bound is necessary for any recovery algorithm to be stable against measurement noise.

2. Lower Bound on $m$

Theorem 1.4: Let $A$ be an $m \times n$ matrix that satisfies the RIP of order $2k$ with constant $\delta _{2k} \in \left(0, \frac{1}{2}\right]$ . Then

m \ge Ck\log \left(\frac{n}{k}\right)

where $C$ is a constant depending only on $\delta_{2k}$ .

The restriction to $\delta_{2k} \le 1/2$ is arbitrary and is made merely for convenience - minor modifications to the argument establish bounds for $\delta_{2k} \le \delta_{max}$ for any $\delta_{max} \lt 1$ .

Theorem 1.4 provides a precise lower bound: $m > Ck \log(n/k)$ (i.e. $m = O(k \log (n/k))$ ). This is necessary to satisfy the RIP and tells us the fundamental minimum number of measurements needed.

3. Relationship to NSP

Theorem 1.5: Suppose that $A$ satisfies the RIP of order $2k$ with $\delta_{2k} \lt \sqrt{2} - 1$ . Then $A$ satisfies the NSP of order $2k$ with constant

C = \frac{2}{1 - (1 + \sqrt{2})\delta_{2k}}

RIP is a stronger condition than NSP. Theorem 1.5 shows that if a matrix satisfies the RIP (with a sufficiently small constant), it automatically satisfies the NSP.

Practicality

It is computationally infeasible to verify the RIP for large matrices. However, its main power comes from the fact that random matrices can be proven to satisfy the RIP with very high probability when $m$ is on the order of $k \log(n/k)$ .

4. Coherence

Coherence is the most practical and easily computable property, though it often leads to stricter requirements.

Definition

The coherence $\mu(A)$ of a matrix $A$ is the largest absolute inner product between any two distinct normalized columns $a_i, a_j$ :

\mu(A) = \max_{1 \le i < j \le n} \frac{|\langle a_i, a_j \rangle|}{||a_i||_2 ||a_j||_2}

$a_i$ and $a_j$ are columns of the matrix $A$ .

What it Guarantees

It provides simple, checkable conditions for unique recovery and stability.

Key Requirements & Properties

1. Uniqueness Condition

Theorem 1.7: If

k \lt \frac{1}{2} \left( 1 + \frac{1}{\mu (A)} \right)

then for each measurement vector $y \in \mathbb{R}^m$ there exists at most one signal $x \in \Sigma_{k}$ such that $y = Ax$ .

This uses coherence to provide a sufficient condition for uniqueness. This often requires $k$ to be much smaller than what RIP would allow.

It is possible to show that the coherence of a matrix is always in the range $\mu (A) \in \left[\sqrt{\frac{n−m}{m(n−1)}}, 1 \right]$ ; the lower bound is known as the Welch bound

Theorem 1.7, together with the Welch bound, provides an upper bound on the level of sparsity $k$ that guarantees uniqueness using coherence: $k = O(\sqrt{m})$ .

2. Relationship to Spark

Lemma 1.4: For any matrix $A$ ,

spark(A) \ge 1 + \frac{1}{\mu (A)}

Coherence provides a lower bound on the spark.

3. Relationship to RIP

Lemma 1.5: If $A$ has unit-norm columns and coherence $\mu = \mu (A)$ , then $A$ satisfies the RIP of order k with $\delta_{k} = (k-1) \mu$ for all $k \lt 1/\mu$ .

Coherence can be used to establish a (sometimes loose) RIP constant for a matrix .

Practicality

Coherence is the only property listed that is easy to compute for any given matrix.

Summary Table

Property	Definition	What it Guarantees	Key Requirement(s) on $k$ or $m$	Practicality
Spark	Smallest # of linearly dependent columns	Necessary and sufficient condition: uniqueness for k-sparse signals	$spark(A) > 2k$ (implies $m >= 2k$ )	Hard to compute
NSP	Null space vectors are not “sparse”	Necessary for any robust/stable recovery	Must hold for order 2k	Hard to verify directly
RIP	Near-isometry for sparse vectors	Sufficient for stable/robust recovery for many algorithms	$m = O(k \log(n/k))$	Hard to verify, but some random matrices highly likely satisfy it
Coherence ( $\mu$ )	Max inner product between columns	Uniqueness and stability, but with stricter conditions	$k$ may be small, typically $O(1/\mu)$	Easy to compute

Understand different error bounds of signal recovery via l1 minimization

Here is an organization of the signal recovery theorems for $l_1$ minimization, grouped by the main assumption made about the sensing matrix A.

Group 1: Guarantees Based on the Restricted Isometry Property (RIP)

These theorems rely on the Restricted Isometry Property (RIP), a strong theoretical condition ensuring that the matrix A nearly preserves the length of sparse vectors.

1.1 The Ideal Case: Noise-Free Recovery

This is the baseline scenario where measurements are perfect.

Theorem: Theorem 1.8
Conditions: A satisfies the RIP of order 2k, and the measurements are exact ( $y=Ax$ ).
Algorithm: Basis Pursuit ( $\min ||z||_1$ subject to $Az=y$ ).
Error Bound: The recovery error is bounded by how “compressible” the original signal x is:
$||\hat{x}-x||_{2}\le C_{0}\frac{\sigma_{k}(x)_{1}}{\sqrt{k}}$
- If the signal x is truly k-sparse, then its approximation error $\sigma_k(x)_1$ is zero, and the recovery is exact ( $||\hat{x}-x||_2 = 0$ ).

1.2 The Realistic Case: Bounded Noise

This scenario assumes the measurement noise is contained within a known energy level.

Theorem: Theorem 1.9
Conditions: A satisfies RIP. The measurement noise e is bounded by $||e||_2 \le \epsilon$ .
Algorithm: A quadratically constrained version of Basis Pursuit ( $\min ||z||_1$ subject to $||Az-y||_2 \le \epsilon$ ).
Error Bound: The error has two parts: one from the signal’s compressibility and one from the noise: $||\hat{x}-x||_{2}\le C_{0}\frac{\sigma_{k}(x)_{1}}{\sqrt{k}}+C_{2}\epsilon$

1.3 A Specific Noise Model: Gaussian Noise

This is a very common scenario in practice. There are two main approaches.

Theorem: Corollary 1.1 (derived from Theorem 1.9)
- Conditions: A satisfies RIP, and the noise is i.i.d. Gaussian.
- Guarantee: The error is proportional to the number of measurements m and the noise variance $\sigma$ : $||\hat{x}-x||_2 \propto \sqrt{m}\sigma$ .
Theorem: Corollary 1.2 (derived from the Dantzig Selector, Theorem 1.10)
- Conditions: A has unit-norm columns, satisfies RIP, and the noise is i.i.d. Gaussian.
- Guarantee: The error is proportional to the sparsity k, the dimension n, and the noise variance $\sigma$ : $||\hat{x}-x||_2 \propto \sqrt{k \log n}\sigma$ . This is often a much tighter and more useful bound, as k log n is typically much smaller than m.

Group 2: Guarantees Based on Coherence

These theorems rely on coherence ( $\mu$ ), which measures the maximum correlation between any two columns of A. Coherence is easier to calculate than RIP but often leads to stricter conditions on the signal’s sparsity k.

Theorem: Theorem 1.11
- Conditions: A has a certain coherence $\mu$ , and the sparsity k must be small relative to it ( $k < (1/\mu+1)/4$ ).
- Error Bound: Provides a worst-case error bound for bounded noise: $||x-\hat{x}||_{2}\le\frac{||e||_{2}+\epsilon}{\sqrt{1-\mu(4k-1)}}$
Theorem: Theorem 1.12 (using the LASSO algorithm)
- Conditions: A has coherence $\mu$ , the sparsity k is small ( $k \le 1/(3\mu)$ ), and the noise is Gaussian.
- Error Bound: In addition to an $l_2$ error bound, this provides a powerful guarantee on recovering the correct sparse structure: with high probability, the recovered non-zero entries are a subset of the true non-zero entries ( $supp(\hat{x}) \subset supp(x)$ ).

Group 3: Theorems on the Nature of the Error Bounds

This final group explains why the error bounds from the RIP-based theorems look the way they do.

The “Pessimistic” Result:
- Theorem: Theorem 1.13
- What it says: It is impossible for any recovery algorithm to provide a “simple” error bound of the form $||\hat{x}-x||_2 \le C\sigma_k(x)_2$ unless you take a huge number of measurements ( $m \approx n$ ). This explains why the theorems use the more complicated $\sigma_k(x)_1/\sqrt{k}$ term.
The “Optimistic” Result (with Randomness):
- Theorem: Theorem 1.14
- What it says: This result shows a way around the limitation of Theorem 1.13. If you use a randomized sensing matrix A, you can achieve the “simple” error bound $||\hat{x}-x||_2 \le C\sigma_k(x)_2$ with high probability, while still using very few measurements. This highlights a key advantage of using random matrices in compressed sensing.

Deterministic or Probabilistic Guarantees

Key Recovery Formulations

The following theorems provide guarantees for solutions to sparse recovery problems. These problems are typically formulated as optimizations, with the goal of finding a signal $x$ that is both sparse and consistent with the measurements $y$ . The chapter focuses on the following formulations.

The Ideal $l_0$ Minimization (Computationally Intractable)
This is the direct statement of finding the sparsest solution, but it is NP-hard and cannot be solved efficiently for large problems .
$\hat{x} = \arg\min_{z} ||z||_0 \quad \text{subject to} \quad z\in\mathcal{B}(y) \quad \text{(1.10)}$
The Constrained $l_1$ Minimization (Basis Pursuit-style)
This is the practical, convex relaxation of the $l_0$ problem and is the focus of most of the theorems. The constraint set $\mathcal{B}(y)$ changes based on the assumptions about noise.
$\hat{x} = \arg\min_{z} ||z||_1 \quad \text{subject to} \quad z\in\mathcal{B}(y) \quad \text{(1.12)}$
The Unconstrained $l_1$ Minimization (LASSO)
This is an alternative, unconstrained formulation that is equivalent to the constrained version for a specific choice of the regularization parameter $\lambda$ .
$\hat{x} = \arg\min_z \frac{1}{2} ||Az-y||_2^2 + \lambda ||z||_1 \quad \text{(1.15)}$

Note that for (1.15) some choice of the parameter $\lambda$ this optimization problem will yield the same result as the constrained version of the problem given by (1.12) with $\mathcal{B}(y)=\{z:||Az-y||_{2}\le\epsilon\}$ .
However, in general the value of $\lambda$ which makes these problems equivalent is unknown a priori.

Types of Guarantees

The chapter presents two main flavors of guarantees:

Deterministic Guarantees: These are absolute “if-then” statements. If you have a matrix $A$ with a specific property (like RIP), then a certain error bound is guaranteed to hold. The challenge is that deterministically constructing a matrix $A$ with these properties for large systems can be very difficult.
Probabilistic Guarantees: These state that an error bound will hold with high probability, where the probability comes from the random choice of the matrix $A$ or the random nature of the noise. The most common approach in compressed sensing is to choose the matrix A randomly. The guarantee then is that with very high probability, the chosen matrix will have the desired properties (like RIP) and the recovery will be successful.

Group 1: Guarantees Based on the Restricted Isometry Property (RIP)

These theorems rely on the RIP, a strong theoretical condition ensuring the matrix $A$ nearly preserves the length of sparse vectors.

1.1 The Ideal Case: Noise-Free Recovery

This is the baseline scenario where measurements are perfect.

Theorem 1.8

Suppose that $A$ satisfies the RIP of order $2k$ with $\delta_{2k}<\sqrt{2}-1$ and we obtain measurements of the form $y=Ax$ . Then when $\mathcal{B}(y)=\{z: Az=y\}$ , the solution $\hat{x}$ to (1.12) obeys

||\hat{x}-x||_{2}\le C_{0}\frac{\sigma_{k}(x)_{1}}{\sqrt{k}}

where

C_0 = 2 \frac{1 - (1 - \sqrt{2}) \delta_{2k}}{1 - (1 + \sqrt{2}) \delta_{2k}}

Type of Guarantee: Deterministic.
Conditions: $A$ satisfies the RIP of order $2k$ , and the measurements are exact ( $y=Ax$ ).

Error Bound

||\hat{x}-x||_{2}\le C_{0}\frac{\sigma_{k}(x)_{1}}{\sqrt{k}}

If the signal x is truly k-sparse, then its approximation error $\sigma_k(x)_1$ is zero, and the recovery is exact ( $||\hat{x}-x||_2 = 0$ ).

1.2 The Realistic Case: Bounded Noise

This scenario assumes the measurement noise is contained within a known energy level.

Theorem 1.9

Suppose that $A$ satisfies the RIP of order $2k$ with $\delta_{2k}<\sqrt{2}-1$ and let $y=Ax+e$ where $||e||_{2}\le\epsilon$ . Then when $\mathcal{B}(y)=\{z:||Az-y||_{2}\le\epsilon\}$ the solution $\hat{x}$ to (1.12) obeys

||\hat{x}-x||_{2}\le C_{0}\frac{\sigma_{k}(x)_{1}}{\sqrt{k}}+C_{2}\epsilon

where

C_0 = 2 \frac{1 - (1 - \sqrt{2}) \delta_{2k}}{1 - (1 + \sqrt{2}) \delta_{2k}}, C_2 = 4 \frac{\sqrt{1 + \delta_{2k}}}{1 - (1 + \sqrt{2}) \delta_{2k}}

Type of Guarantee: Deterministic.
Conditions: $A$ satisfies RIP. The measurement noise $e$ is bounded by $||e||_2 \le \epsilon$ .

Error Bound

The error has two parts: one from the signal’s compressibility and one from the noise.

||\hat{x}-x||_{2}\le C_{0}\frac{\sigma_{k}(x)_{1}}{\sqrt{k}}+C_{2}\epsilon

Theorem 1.10

Suppose that $A$ satisfies the RIP of order $2k$ with $\delta_{2k}<\sqrt{2}-1$ and we obtain measurements of the form $y=Ax+e$ where $||A^{T}e||_{\infty}\le\lambda.$ Then when $\mathcal{B}(y)=\{z:||A^{T}(Az-y)||_{\infty}\le\lambda\}$ the solution $\hat{x}$ to (1.12) obeys

||\hat{x}-x||_{2}\le C_{0}\frac{\sigma_{k}(x)_{1}}{\sqrt{k}}+C_{3}\sqrt{k}\lambda

where

C_0 = 2 \frac{1 - (1 - \sqrt{2}) \delta_{2k}}{1 - (1 + \sqrt{2}) \delta_{2k}}, C_3 = \frac{4 \sqrt{2}}{1 - (1 + \sqrt{2}) \delta_{2k}}

Type of Guarantee: Deterministic.
Conditions: $A$ satisfies RIP. The noise is bounded such that $||A^T e||_\infty \le \lambda$ .
Algorithm: The Dantzig Selector (Eq. 1.12 with $\mathcal{B}(y)=\{z:||A^{T}(Az-y)||_{\infty}\le\lambda\}$ ).
Error Bound: $||\hat{x}-x||_{2}\le C_{0}\frac{\sigma_{k}(x)_{1}}{\sqrt{k}}+C_{3}\sqrt{k}\lambda$

1.3 A Specific Noise Model: Gaussian Noise

This is a very common scenario in practice. There are two main approaches.

Corollary 1.1 (derived from Theorem 1.9)

Suppose that $A$ satisfies the RIP of order $2k$ with $\delta_{2k}<\sqrt{2}-1$ . Furthermore, suppose that $x\in\Sigma_{k}$ and that we obtain measurements of the form $y=Ax+e$ where the entries of e are i.i.d. $\mathcal{N}(0,\sigma^{2})$ . Then when $\mathcal{B}(y)=\{z:||Az-y||_{2}\le 2\sqrt{m}\sigma\}$ , the solution $\hat{x}$ to (1.12) obeys

||\hat{x}-x||_{2}\le8\frac{\sqrt{1+\delta_{2k}}}{1-(1+\sqrt{2})\delta_{2k}}\sqrt{m}\sigma

with probability at least $1-\exp(-c_{0}m)$ , where $c_0 \gt 0$ is a constant.

Type of Guarantee: Probabilistic (over the random draw of the noise).
Conditions: $A$ satisfies RIP, noise is i.i.d. Gaussian, and $x$ is k-sparse.

Error Bound

The error is proportional to the sqrt of number of measurements $m$ and the noise variance

||\hat{x}-x||_{2}\le8\frac{\sqrt{1+\delta_{2k}}}{1-(1+\sqrt{2})\delta_{2k}}\sqrt{m}\sigma

Corollary 1.2 (derived from the Dantzig Selector, Theorem 1.10)

Suppose that $A$ has unit-norm columns and satisfies the RIP of order $2k$ with $\delta_{2k}<\sqrt{2}-1.$ Furthermore, suppose that $x\in\Sigma_{k}$ and that we obtain measurements of the form $y=Ax+e$ where the entries of $e$ are i.i.d. $\mathcal{N}(0,\sigma^{2})$ . Then when $\mathcal{B}(y)=\{z:||A^{T}(Az-y)||_{\infty}\le 2\sqrt{\log n} \sigma\}$ , the solution $\hat{x}$ to (1.12) obeys

||\hat{x}-x||_{2}\le 4\sqrt{2}\frac{\sqrt{1+\delta_{2k}}}{1-(1+\sqrt{2})\delta_{2k}}\sqrt{k \log n} \sigma

with probability at least $1-\frac{1}{n}$ .

Type of Guarantee: Probabilistic (over the random draw of the noise).
Conditions: $A$ satisfies RIP, has unit-norm columns, noise is i.i.d. Gaussian, and $x$ is k-sparse.

Error Bound

The error is proportional to the sqrt of sparsity $k$ , the sqrt of log of dimension $n$ , and the noise variance $\sigma$ : $||\hat{x}-x||_2 \propto \sqrt{k \log n}\sigma$ . This might be a more useful bound, as $k \log n$ may be much smaller than $m$ .

||\hat{x}-x||_{2}\le4\sqrt{2}\frac{\sqrt{1+\delta_{2k}}}{1-(1+\sqrt{2})\delta_{2k}}\sqrt{k \log n} \sigma

Group 2: Guarantees Based on Coherence

These theorems rely on coherence ( $\mu$ ), which measures the maximum correlation between any two columns of $A$ . Coherence is easier to calculate than RIP but often leads to stricter conditions on the signal’s sparsity $k$ (the signal should be sparser).

Theorem 1.11

Suppose that $A$ has coherence $\mu$ and that $x\in\Sigma_{k}$ with $k<(1/\mu+1)/4.$ Furthermore, suppose that we obtain measurements of the form $y=Ax+e.$ Then when $\mathcal{B}(y)=\{z:||Az-y||_{2}\le\epsilon\},$ the solution $\hat{x}$ to (1.12) obeys

||x-\hat{x}||_{2}\le\frac{||e||_{2}+\epsilon}{\sqrt{1-\mu(4k-1)}}

Type of Guarantee: Deterministic.
Conditions: $k$ is small ( $k < (1/\mu+1)/4$ ), and $x$ is k-sparse.

Error Bound

||x-\hat{x}||_{2}\le\frac{||e||_{2}+\epsilon}{\sqrt{1-\mu(4k-1)}}

Provides a worst-case analysis error bound for bounded noise, will typically overestimate the actual error.

Theorem 1.12

Suppose that $A$ has coherence $\mu$ and that $x\in \Sigma_k$ with $k\le1/(3\mu)$ . Furthermore, suppose that we obtain measurements of the form $y=Ax+e$ where the entries of e are i.i.d. $\mathcal{N}(0,\sigma^{2})$ . Set

\lambda = \sqrt{8\sigma^2(1+\alpha) \log (n-k)}

for some fairly small value $\alpha>0$ . Then with probability exceeding

\left(1-\frac{1}{(n-k)^{\alpha}}\right)(1-\exp (-k/7))

the solution $\hat{x}$ to (1.15) is unique, $\text{supp}(\hat{x})\subset \text{supp}(x)$ and has the following error bound.

Type of Guarantee: Probabilistic (over the random draw of the noise).
Conditions: $k$ is small ( $k \le 1/(3\mu)$ ), noise is i.i.d. Gaussian, and $x$ is k-sparse.
Algorithm: LASSO (Eq. 1.15).

Error Bound

||\hat{x}-x||_{2}^{2} \le \left(\sqrt{3}+3\sqrt{2(1+\alpha)\log(n-k)}\right)^{2}k\sigma^{2}

In addition to an $l_2$ error bound, this provides a powerful guarantee on recovering the correct sparse structure: with high probability, the recovered non-zero entries are a subset of the true non-zero entries ( $\text{supp}(\hat{x}) \subset \text{supp}(x)$ ).

Group 3: Theorems on the Nature of the Error Bounds

This final group explains why the error bounds from the RIP-based theorems look the way they do.

Theorem 1.13 (The “Pessimistic” Result)

Suppose that $A$ is an $m\times n$ matrix and that $\Delta:\mathbb{R}^{m}\rightarrow\mathbb{R}^{n}$ is a recovery algorithm that satisfies

||x-\Delta(Ax)||_{2}\le C\sigma_{k}(x)_{2}

for some $k\ge1,$ then $m \gt \left(1-\sqrt{1-1/C^{2}}\right)n.$

Type of Guarantee: Deterministic.
What it says: It is impossible for any algorithm with a deterministic matrix $A$ to satisfy the “simple” error bound $||\hat{x}-x||_{2}\le C\sigma_{k}(x)_{2}$ unless you take a huge number of measurements ( $m \approx n$ ). This explains why the theorems use $\sigma_k(x)_1/\sqrt{k}$ term.

Theorem 1.14 (The “Optimistic” Result with Randomness)

Let $x\in\mathbb{R}^{n}$ be fixed. Set $\delta_{2k}<\sqrt{2}-1.$ Suppose that $A$ is an $m\times n$ sub-gaussian random matrix with $m=O(k \log (n/k)/\delta_{2k}^{2}).$ Suppose we obtain measurements of the form $y=Ax$ . Set $\epsilon=2\sigma_{k}(x)_{2}$ . Then with probability exceeding $1-2\exp(-c_{1}\delta_{2k}^{2}m)-\exp(-c_{0}m)$ , when $\mathcal{B}(y)=\{z:||Az-y||_{2}\le\epsilon\}$ , the solution $\hat{x}$ to (1.12) obeys the following bound.

Type of Guarantee: Probabilistic (over the random choice of the matrix A).
What it says: If you use a random sensing matrix $A$ , you can achieve the “simple” error bound with high probability, using far fewer measurements.
Error Bound:

||\hat{x}-x||_{2}\le\frac{8\sqrt{1+\delta_{2k}}-(1+\sqrt{2})\delta_{2k}}{1-(1+\sqrt{2})\delta_{2k}}\sigma_{k}(x)_{2}

Instance-Optimal Guarantees

An “instance-optimal” guarantee means the quality of the recovery error bound adapts to the specific signal instance $x$ that you are measuring. Instead of a single worst-case error for all signals, the error bound is a function of the properties of that particular signal, such as how compressible it is (e.g. measured by $\sigma_k(x)_1$ or $\sigma_k(x)_2$ ).

In deterministic cases, examples are Theorem 1.8, Theorem 1.9, Theorem 1.10.

This concept is then combined with probabilistic guarantees in two different ways.

The Two “Flavors” of Probabilistic Guarantees

The distinction comes down to whether one “good” random matrix $A$ works for all signals, or if the guarantee only applies to a specific signal-matrix pair.

1. The Stronger Guarantee: A “Universal” Random Matrix

This is the typical approach.

The Process: You generate a random matrix $A$ once.
The Guarantee: With very high probability, this single matrix $A$ you generated is “good.” A “good” matrix is one that satisfies a deterministic guarantee (like the RIP-based theorems) for all possible signals $x$ .

2. The Weaker Guarantee: “Instance-Optimal in Probability”

This is what Theorem 1.14 provides.

The Process: You are first given a specific signal instance $x$ . Then, you generate a random matrix $A$ to measure it.
The Guarantee: With high probability, the recovery of that specific signal $x$ using that specific matrix $A$ will be successful. The guarantee is probabilistic and only applies to the instance you started with. It doesn’t promise anything about how that same matrix $A$ would perform on a different signal.

In summary, “instance-optimal in probability” is the weaker type of probabilistic guarantee where the success probability is tied to a specific signal instance, and you might need to draw a new random matrix for each new signal.

Non Instance-Optimal Guarantees

Examples are Corollaries 1.1, 1.2 and Theorems 1.11, 1.12.

These guarantees provide a single, uniform error bound that applies to the entire class of k-sparse signals. As long as $x$ is k-sparse, the error bound is the same regardless of which specific k-sparse signal it is. The bound depends on parameters like $m$ , $n$ , $k$ , and the noise level, but not on the structure of $x$ beyond its k-sparsity.

Fundamental Concepts from Set Theory & Linear Algebra

Rn\mathbb{R}^nRn (n-dimensional Euclidean Space)

Basis

Dimension

Affine space

A bit more math used in Compressed Sensing

Vector Space Concepts

lpl_plp​ Norms

l0l_0l0​ “Norm”

Inner Product

Basis and Frames

Signal Models

Sparse and Compressible Signals

Union of Subspaces and Low-Rank Models

Signal Recovery

l1l_1l1​ Minimization

Recovery Guarantees

Multiple Measurement Vector (MMV) Problem

Understand matrix properties in compressed sensing

Coherence

1. Spark

Definition

What it Guarantees

Key Requirement

Practicality

2. Null Space Property (NSP)

Definition

What it Guarantees

Practicality

3. Restricted Isometry Property (RIP)

Definition

What it Guarantees

Key Requirements & Properties

1. Necessity for Stability

2. Lower Bound on mmm

3. Relationship to NSP

Practicality

4. Coherence

Definition

What it Guarantees

Key Requirements & Properties

1. Uniqueness Condition

2. Relationship to Spark

3. Relationship to RIP

Practicality

Summary Table

Understand different error bounds of signal recovery via l1 minimization

Group 1: Guarantees Based on the Restricted Isometry Property (RIP)

1.1 The Ideal Case: Noise-Free Recovery

1.2 The Realistic Case: Bounded Noise

1.3 A Specific Noise Model: Gaussian Noise

Group 2: Guarantees Based on Coherence

Group 3: Theorems on the Nature of the Error Bounds

Deterministic or Probabilistic Guarantees

Key Recovery Formulations

Types of Guarantees

Group 1: Guarantees Based on the Restricted Isometry Property (RIP)

1.1 The Ideal Case: Noise-Free Recovery

Theorem 1.8

Error Bound

1.2 The Realistic Case: Bounded Noise

Theorem 1.9

Error Bound

Theorem 1.10

1.3 A Specific Noise Model: Gaussian Noise

Corollary 1.1 (derived from Theorem 1.9)

Error Bound

Corollary 1.2 (derived from the Dantzig Selector, Theorem 1.10)

Error Bound

Group 2: Guarantees Based on Coherence

Theorem 1.11

Error Bound

Theorem 1.12

Error Bound

Group 3: Theorems on the Nature of the Error Bounds

Theorem 1.13 (The “Pessimistic” Result)

Theorem 1.14 (The “Optimistic” Result with Randomness)

Instance-Optimal Guarantees

The Two “Flavors” of Probabilistic Guarantees

1. The Stronger Guarantee: A “Universal” Random Matrix

$\mathbb{R}^n$ (n-dimensional Euclidean Space)

$l_p$ Norms

$l_0$ “Norm”

$l_1$ Minimization

2. Lower Bound on $m$