Convex Optimization With Abstract Linear Operators

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Convex Optimization with Abstract Linear Operators

Steven Diamond and Stephen Boyd


Dept. of Computer Science and Electrical Engineering, Stanford University
{stevend2, boyd}@stanford.edu

Abstract optimization modeling system.


Many convex optimization problems in applications like
We introduce a convex optimization modeling framework signal and image processing, or medical imaging, involve
that transforms a convex optimization problem expressed in hundreds of thousands or many millions of variables, and
a form natural and convenient for the user into an equiva- so are well out of the range that current modeling systems
lent cone program in a way that preserves fast linear trans- can handle. There are two reasons for this. First, the stan-
forms in the original problem. By representing linear func- dard form problem that would be created is too large to store
tions in the transformation process not as matrices, but as on a single machine, and second, even if it could be stored,
graphs that encode composition of abstract linear opera- standard interior-point solvers would be too slow to solve it.
tors, we arrive at a matrix-free cone program, i.e., one Yet many of these problems are readily solved on a single
whose data matrix is represented by an abstract linear oper- machine by custom solvers, which exploit fast linear trans-
ator and its adjoint. This cone program can then be solved forms in the problems. The key to these custom solvers is
by a matrix-free cone solver. By combining the matrix-free to directly use the fast transforms, never forming the associ-
modeling framework and cone solver, we obtain a general ated matrix. For this reason these algorithms are sometimes
method for efficiently solving convex optimization problems referred to as matrix-free solvers.
involving fast linear transforms. The literature on matrix-free solvers in signal and im-
age processing is extensive; see, e.g., [3, 4, 10, 9, 25, 49].
There has been particular interest in matrix-free solvers for
1. Introduction LASSO and basis pursuit denoising problems [4, 11, 22,
18, 31, 48]. The most general matrix-free solvers target
Convex optimization modeling systems like YALMIP semidefinite programs [32] or quadratic programs and re-
[38], CVX [28], CVXPY [16], and Convex.jl [47] provide lated problems [43, 26]. The software closest to a convex
an automated framework for converting a convex optimiza- optimization modeling system for matrix-free problems is
tion problem expressed in a natural human-readable form TFOCS, which allows users to specify many types of con-
into the standard form required by a generic solver, calling vex problems and solve them using a variety of matrix-free
the solver, and transforming the solution back to the human- first-order methods [5].
readable form. This allows users to form and solve con- To better understand the advantages of matrix-free
vex optimization problems quickly and efficiently. These solvers, consider the nonnegative deconvolution problem
systems easily handle problems with a few thousand vari-
ables, as well as much larger problems (say, with hundreds minimize kc ∗ x − bk2
of thousands of variables) with enough sparsity structure, (1)
subject to x ≥ 0,
which generic solvers can exploit.
The overhead of the problem transformation, and the ad- where x ∈ Rn is the optimization variable, c ∈ Rn and
ditional variables and constraints introduced in the trans- b ∈ R2n−1 are problem data, and ∗ denotes convolu-
formation process, result in longer solve times than can be tion. Note that the problem data has size O(n). There
obtained with a custom algorithm tailored specifically for are many custom matrix-free methods for efficiently solv-
the particular problem. Perhaps surprisingly, the additional ing this problem, with O(n) memory and a few hundred
solve time (compared to a custom solver) for a modeling iterations, each of which costs O(n log n) floating point op-
system coupled to a generic solver is often not as much as erations (flops). It is entirely practical to solve instances of
one might imagine, at least for modest sized problems. In this problem of size n = 107 on a single computer [36].
many cases the convenience of easily expressing the prob- Existing convex optimization modeling systems fall far
lem makes up for the increased solve time using a convex short of the efficiency of matrix-free solvers on problem (1).

675
These modeling systems target a standard form in which O(mn) flops by computing the matrix-vector multiplication
a problem’s linear structure is represented as a sparse ma- Ax. We can likewise evaluate the adjoint f ∗ (y) = AT y on
trix. As a result, linear functions must be converted into an input y ∈ Rm in O(mn) flops by computing AT y.
explicit matrix multiplication. In particular, the operation Many linear functions arising in applications have struc-
of convolving by c will be represented as multiplication by ture that allows the function and its adjoint to be evaluated
a (2n − 1) × n Toeplitz matrix C. A modeling system will in fewer than O(mn) flops or using fewer than O(mn)
thus transform problem (1) into the problem bytes of data. The algorithms and data structures used to
evaluate such a function and its adjoint can differ wildly.
minimize kCx − bk2 It is thus useful to abstract away the details and view lin-
(2)
subject to x ≥ 0, ear functions as forward-adjoint oracles (FAOs), i.e., a tu-
ple Γ = (f, Φf , Φf ∗ ) where f is a linear function, Φf is
as part of the conversion into standard form.
an algorithm for evaluating f , and Φf ∗ is an algorithm for
Once the transformation from (1) to (2) has taken place,
evaluating f ∗ . For simplicity we assume that the algorithms
there is no hope of solving the problem efficiently. The ex-
Φf and Φf ∗ in an FAO read from an input array and write to
plicit matrix representation of C requires O(n2 ) memory.
an output array (which can be the same as the input array).
A typical interior-point method for solving the transformed
We use n to denote the length of the input array and m to
problem will take a few tens of iterations, each requiring
denote the length of the output array.
O(n3 ) flops. For this reason existing convex optimization
While we focus on linear functions from Rn into Rm ,
modeling systems will struggle to solve instances of prob-
the same techniques can be used to handle linear functions
lem (1) with n = 104 , and when they are able to solve
involving complex arguments or values, i.e., from Cn into
the problem, they will be dramatically slower than custom
Cm , from Rn into Cm , or from Cn into Rm , using the stan-
matrix-free methods.
dard embedding of complex n-vectors into real 2n-vectors.
The key to matrix-free methods is to exploit fast algo-
This is useful for problems in which complex data arise nat-
rithms for evaluating a linear function and its adjoint. We
urally (e.g., in signal processing and communications), and
call an implementation of a linear function that allows us
also in some cases that involve only real data, where com-
to evaluate the function and its adjoint a forward-adjoint
plex intermediate results appear (typically via an FFT).
oracle (FAO). In this paper we describe a new algorithm
for converting convex optimization problems into standard
2.2. Examples
form while preserving fast linear transforms. The algorithm
expresses the standard form’s linear structure as an abstract In this section we describe some useful FAOs. In many
linear operator (specifically, a graph of FAOs) rather than as of the examples the domain or range are naturally viewed
an explicit sparse matrix. as matrices or Cartesian products rather than as vectors in
Our new algorithm yields a convex optimization model- Rn and Rm . Matrices are treated as vectors by stacking the
ing system that can take advantage of fast linear transforms, columns into a single vector; Cartesian products are treated
and can be used to solve large problems such as those aris- as vectors by stacking the components. For the purpose of
ing in image and signal processing and other areas, with determining the adjoint, we still regard these FAOs as func-
millions of variables. This allows users to rapidly proto- tions from Rn into Rm .
type and implement new convex optimization based meth-
ods for large-scale problems. As with current modeling sys-
Multiplication by a sparse matrix. Multiplication by a
tems, the goal is not to attain (or beat) the performance of
sparse matrix A ∈ Rm×n , i.e., a matrix with many zero en-
a custom solver tuned for the specific problem; rather it is
tries, is represented by the FAO Γ = (f, Φf , Φf ∗ ), where
to make the specification of the problem straightforward,
f (x) = Ax. The adjoint f ∗ (u) = AT u is also multipli-
while increasing solve times only moderately.
cation by a sparse matrix. The algorithms Φf and Φf ∗ are
Due to space limitations, we cannot give the full details
the standard algorithm for multiplying by a sparse matrix
of our approach. A longer paper still in development con-
in (for example) compressed sparse row format. Evaluat-
tains the details, as well as additional references and numer-
ing Φf and Φf ∗ requires O(nnz(A)) flops and O(nnz(A))
ical examples [15].
bytes of data to store A and AT , where nnz is the number
of nonzero elements in a sparse matrix [14, Chap. 2].
2. Forward-adjoint oracles
2.1. Definition
Multiplication by a low-rank matrix. Multiplication by
A general linear function f : Rn → Rm can be repre- a matrix A ∈ Rm×n with rank k, where k ≪ m and
sented on a computer as a dense matrix A ∈ Rm×n using k ≪ n, is represented by the FAO Γ = (f, Φf , Φf ∗ ),
O(mn) bytes. We can evaluate f (x) on an input x ∈ Rn in where f (x) = Ax. The matrix A can be factored as

676
A = BC, where B ∈ Rm×k and C ∈ Rk×n . The ad- of c:
c1
 
joint f ∗ (u) = C T B T u is also multiplication by a rank k
matrix. The algorithm Φf evaluates f (x) by first evaluat-  .. 
 c2 . 
ing z = Cx and then evaluating f (x) = Bz. Similarly, 
 .. ..


Φf ∗ multiplies by B T and then C T . The algorithms Φf Col(c) = 
 . . c1 .

and Φf ∗ require O(k(m + n)) flops and use O(k(m + n)) 
 cp c2 

bytes of data to store B and C and their transposes. Multi-  .. .. 
 . . 
plication by a low-rank matrix occurs in many applications,
cp
and it is often possible to approximate multiplication by a
full rank matrix with multiplication by a low-rank one, us- Another standard form, row convolution, restricts the in-
ing the singular value decomposition or methods such as dices in (4) to the range k = p, . . . , n. For simplicity
sketching [34]. we assume that n ≥ p. In this case the associated matrix
Row(c) ∈ Rn−p+1×n is Toeplitz, with each row a shifted
version of c, in reverse order:
Discrete Fourier transform. The discrete Fourier trans- 
cp cp−1 . . . c1

form (DFT) is represented by the FAO Γ = (f, Φf , Φf ∗ ), .. .. ..
Row(c) =  .
 
where f : R2p → R2p is given by . . .
cp cp−1 ... c1
 
Pp (j−1)(k−1)
f (x)k = √1 ℜ ω p xj The matrices Col(c) and Row(c) are related by the equal-
p j=1
 
−ℑ ωp
(j−1)(k−1)
xj+p ities
(3) Col(c)T = Row(rev(c))
(5)
 
p (j−1)(k−1)
√1 Row(c)T = Col(rev(c)),
P
f (x)k+p = p j=1 ℑ ω p xj
 
(j−1)(k−1)
+ℜ ωp xj+p where rev(c)k = cp−k+1 reverses the order of the entries
of c.
Column convolution with c ∈ Rp is represented by the
for k = 1, . . . , p. Here ωp = e−2πi/p . The adjoint f ∗ is the
FAO Γ = (f, Φf , Φf ∗ ), where f : Rn → Rn+p−1 is given
inverse DFT. The algorithm Φf is the fast Fourier transform
by f (x) = Col(c)x. The adjoint f ∗ is row convolution
(FFT), while Φf ∗ is the inverse FFT. The algorithms can be
with rev(c), i.e., f ∗ (u) = Row(rev(c))u. The algorithms
evaluated in O((m + n) log(m + n)) flops, using only O(1)
Φf and Φf ∗ use the DFT to transform convolution into ele-
bytes of data to store the dimensions of f ’s input and output
mentwise multiplication and require O((m+n) log(m+n))
[13, 37]. Here m = n = 2p.
flops [37]. Here m = n + p − 1. If the kernel is small (i.e.,
The 2-D DFT has the same computational complexity. p ≪ n), Φf and Φf ∗ instead evaluate (4) directly in O(np)
In its FAO representation Γ = (f, Φf , Φf ∗ ), the algorithms flops. In either case, the algorithms Φf and Φf ∗ use O(p)
Φf and Φf ∗ also require O((m+n) log(m+n)) flops, using bytes of data to store c and rev(c).
only O(1) bytes of data to store the dimensions of f ’s input The 2-D analogue of column convolution has the same
and output [35, 37]. computational complexity as the 1-D case. The adjoint of
2-D column convolution is a 2-D analogue of row convolu-
tion. 2-D column convolution with a kernel C ∈ Rp×q has
Convolution. Convolution with a kernel c ∈ Rp is de- an FAO representation Γ = (f, Φf , Φf ∗ ) where the algo-
fined as f : Rn → Rm , where rithms Φf and Φf ∗ require

X O(min{(m + n) log(m + n), pqn}),


f (x)k = ci xj , k = 1, . . . , m. (4)
i+j=k+1 flops and use O(pq) bytes of data to store C [37, Chap. 4].
Often the kernel is parameterized (e.g., a Gaussian kernel),
Different variants of convolution restrict the indices i, j to in which case more compact representations of C are pos-
different ranges, or interpret vector elements outside their sible [20, Chap. 7].
natural ranges as zero or using periodic (circular) indexing.
Standard (column) convolution takes m = n + p − 1, Fast transforms. There are many other linear functions
and defines ci and xj in (4) as zero when the index is ou- for which the function and its adjoint can be computed
side its range. In this case the associated matrix Col(c) ∈ efficiently. These typically have m = n, and are called
Rn+p−1×n is Toeplitz, with each column a shifted version transforms. Examples include the discrete wavelet, Hartley,

677
Haar, and Walsh-Hadamard transforms, which can be eval- sum
uated in O(n) or O(n log n) flops (and the same for their
adjoints). Due to space limitations we omit the details here.

Matrix product. Multiplication on the left by a matrix A B


A ∈ Rs×p and on the right by a matrix B ∈ Rq×t is
represented by the FAO Γ = (f, Φf , Φf ∗ ), where f :
Rp×q → Rs×t is given by f (X) = AXB. The adjoint copy
f ∗ (U ) = AT U B T is also a matrix product. The linear
functions f and f ∗ can be represented as matrix multiplica-
Figure 1: The FAO DAG for f (x) = Ax + Bx.
tion (with vector input and output) using Kronecker prod-
ucts.
There are two ways to implement Φf efficiently, corre- Each node in the FAO DAG stores the following at-
sponding to different orders of operations in multiplying out tributes:
AXB. In one method we multiply by A first and B second,
for a total of O(s(pq + qt)) flops (assuming that A and B • An FAO Γ = (f, Φf , Φf ∗ ). Concretely, f is a sym-
are dense). In the other method we multiply by B first and bol identifying the function, and Φf and Φf ∗ are exe-
A second, for a total of O(p(qt + st)) flops. The former cutable code.
method is more efficient if • The data needed to evaluate Φf and Φf ∗ .
1 1 1 1 • An input array.
+ < + . • An output array.
t p s q
• A list Ein of incoming edges.
Similarly, there are two ways to implement Φf ∗ , one requir-
• A list Eout of outgoing edges.
ing O(s(pq+qt)) flops and the other requiring O(p(qt+st))
flops. The algorithms Φf and Φf ∗ use O(sp + qt) bytes The input and output arrays can be the same when the FAO
of data to store A and B and their transposes. When algorithms operate in-place, i.e., write the output to the ar-
p = q = s = t, the flop count for Φf and Φf ∗ simpli- ray storing the input. The total bytes needed to store an FAO
fies to O (m + n)1.5 flops. Here m = n = pq. (When DAG is dominated by the sum of the bytes of data on each
the matrices A or B are sparse, evaluating f (X) and f ∗ (U ) node. When the same FAO occurs more than once in the
can be done even more efficiently.) FAO DAG, we can reduce the total bytes of data needed by
sharing data across duplicate nodes.
Sum and copy. The function sum : Rm × · · · × As an example, figure 1 shows the FAO DAG for the
Rm → Rm with k inputs is represented by the FAO Γ = composition f (x) = Ax + Bx, where A ∈ Rm×n and B ∈
(f, Φf , Φf ∗ ), where f (x1 , . . . , xk ) = x1 + · · · + xk . The Rm×n are dense matrices. The copy node duplicates the
adjoint f ∗ is the function copy : Rm → Rm × · · · × Rm , input x ∈ Rn into the multi-argument output (x, x) ∈ Rn ×
which outputs k copies of its input. The algorithms Φf and Rn . The A and B nodes multiply by A and B, respectively.
Φf ∗ require O(m + n) flops to sum and copy their input, The sum node sums two vectors together. The copy node
respectively, using only O(1) bytes of data to store the di- is the start node, and the sum node is the end node. The
mensions of f ’s input and output. Here n = km. FAO DAG requires O(mn) bytes to store, since the A and
B nodes store the matrices A and B and their tranposes.
2.3. Compositions
In this section we consider compositions of FAOs. In fact Forward evaluation. To evaluate the composition
we have already discussed several linear functions that are f (x) = Ax + Bx using the FAO DAG in figure 1, we first
naturally and efficiently represented as compositions, such evaluate the start node on the input x ∈ Rn , which copies
as multiplication by a low-rank matrix and matrix prod- x and sends it out on both outgoing edges. We evaluate the
uct. Here though we present a data structure and algorithm A and B nodes (serially or in parallel) on their incoming
for efficiently evaluating any composition and its adjoint, argument, and send the results (Ax and Bx) to the end
which gives us an FAO representing the composition. node. Finally, we evaluate the end node on its incoming
A composition of FAOs can be represented using a di- arguments to obtain the result Ax + Bx.
rected acyclic graph (DAG) with exactly one node with no The general procedure for evaluating an FAO DAG is
incoming edges (the start node) and exactly one node with given in algorithm 1. The algorithm evaluates the nodes in
no outgoing edges (the end node). We call such a represen- a topological order. The total flop count is the sum of the
tation an FAO DAG. flops from evaluating the algorithm Φf on each node. If we

678
allocate all scratch space needed by the FAO algorithms in copy
advance, then no memory is allocated during the algorithm.

Algorithm 1 Evaluate an FAO DAG.


Input: G = (V, E) is an FAO DAG representing a function AT BT
f . V is a list of nodes. E is a list of edges. I is a list of
inputs to f . O is a list of outputs from f . Each element
of I and O is represented as an array. sum

Copy the elements of I onto the start node’s input array.


Figure 2: The FAO DAG for the adjoint f ∗ (u) = AT u + B T u.
Create an empty queue Q for nodes that are ready to eval-
uate.
Create an empty set S for nodes that have been evaluated. be parallelized by replacing a node’s algorithm Φf with a
Add G’s start node to Q. parallel variant. For example, the standard algorithms for
while Q is not empty do dense and sparse matrix multiplication can be trivially par-
u ← pop the front node of Q. allelized.
Evaluate u’s algorithm Φf on u’s input array,
writing the result to u’s output array. 3. Cone programs and solvers
Add u to S.
for each edge e = (u, v) in u’s Eout do 3.1. Cone programs
i ← the index of e in u’s Eout . A cone program is a convex optimization problem of the
j ← the index of e in v’s Ein . form
Copy the segment of u’s output array holding minimize cT x
(6)
output i onto the segment of v’s input array subject to Ax + b ∈ K,
holding input j. where x ∈ Rn is the optimization variable, K is a convex
if for all edges (p, v) in v’s Ein , p is in S then cone, and A ∈ Rm×n , c ∈ Rn , and b ∈ Rm are problem
Add v to the end of Q. data. Cone programs are a broad class that include linear
Copy the output array of G’s end node onto the elements programs, second-order cone programs, and semidefinite
of O. programs as special cases [40, 8]. We call the cone pro-
gram matrix-free if A is represented implicitly as an FAO,
Postcondition: O contains the outputs of f applied to in- rather than explicitly as a dense or sparse matrix.
puts I. The convex cone K is typically a Cartesian product of
simple convex cones from the following list:
• Zero cone: K0 = {0}.
Adjoint evaluation. Given an FAO DAG G represent- • Free cone: Kfree = R.
ing a function f , we can easily generate an FAO DAG • Nonnegative cone: K+ = {x ∈ R | x ≥ 0}.
G∗ representing the adjoint f ∗ . We modify each node in • Second-order cone:
G, replacing the node’s FAO (f, Φf , Φf ∗ ) with the FAO
(f ∗ , Φf ∗ , Φf ) and swapping Ein and Eout . We also reverse Ksoc = {(x, t) ∈ Rn+1 | x ∈ Rn , t ∈ R, kxk2 ≤ t}.
the orientation of each edge in G. We can apply algorithm
1 to the resulting graph G∗ to evaluate f ∗ . Figure 2 shows • Positive semidefinite cone:
the FAO DAG in figure 1 transformed into an FAO DAG for
the adjoint. Kpsd = {X | X ∈ Sn , z T Xz ≥ 0 for all z ∈ Rn }.

• Exponential cone:
Parallelism. Algorithm 1 can be easily parallelized, since
the nodes in the ready queue Q can be evaluated in any or- Kexp = {(x, y, z) ∈ R3 | y > 0, yex/y ≤ z}
der. A simple parallel implementation could use a thread ∪{(x, 0, z) ∈ R3 | x ≤ 0, z ≥ 0}.
pool with t threads to evaluate up to t nodes in the ready
queue at a time. The extent to which parallelism speeds • Power cone:
up evaluation of the composition graph depends on how a
Kpwr = {(x, y, z) ∈ R3 | xa y (1−a) ≥ |z|, x, y ≥ 0},
many parallel paths there are in the graph, i.e., paths with no
shared nodes. The evaluation of individual nodes can also where a ∈ [0, 1].

679
These cones are useful in expressing common problems (via Many other matrix-free algorithms for solving SDPs have
canonicalization), and can be handled by various solvers (as been proposed; see, e.g., [24, 45, 50].
discussed below). Several matrix-free solvers have been developed for
Cone programs that include only cones from certain sub- quadratic programs (QPs), which are a superset of lin-
sets of the list above have special names. For example, if ear programs and a subset of second-order cone programs.
the only cones are zero, free, and nonnegative cones, the Gondzio developed a matrix-free interior-point method for
cone program is a linear program; if in addition it includes QPs that solves linear systems using a preconditioned
the second-order cone, it is called a second-order cone pro- conjugate gradient method [26]. PDCO is a matrix-free
gram. A well studied special case is so-called symmetric interior-point solver that can solve QPs [43], using LSMR
cone programs, which include the zero, free, nonnegative, to solve linear systems [19].
second-order, and positive semidefinite cones. Semidefi-
nite programs, where the cone constraint consists of a single 4. Matrix-free canonicalization
positive semidefinite cone, are another common case.
4.1. Canonicalization
3.2. Cone solvers Canonicalization is an algorithm that takes as input a
Many methods have been developed to solve cone pro- data structure representing a general convex optimization
grams, the most widely used being interior-point methods. problem and outputs a data structure representing an equiv-
alent cone program. By solving the cone program, we
recover the solution to the original optimization problem.
Interior-point. A large number of interior-point cone This approach is used by convex optimization modeling
solvers have been implemented. Most support symmet- systems such as YALMIP [38], CVX [28], CVXPY [16],
ric cone programs. For example, SDPT3 [46] and Se- and Convex.jl [47]. Current methods of canonicalization
DuMi [44] are open-source solvers implemented in MAT- convert fast linear transforms in the original problem into
LAB; CVXOPT [2] is an open-source solver implemented multiplication by a dense or sparse matrix, which makes the
in Python; MOSEK [39] is a commercial solver with in- final cone program far more costly to solve than the original
terfaces to many languages. ECOS is an open-source cone problem.
solver written in library-free C that supports second-order The canonicalization algorithm can be modified, how-
cone programs [17]; Akle extended ECOS to support the ever, so that fast linear transforms are preserved. The key is
exponential cone [1]. DSDP5 [6] and SDPA [23] are open- to represent all linear functions arising during the canonical-
source solvers for semidefinite programs implemented in C ization process as FAO DAGs instead of as sparse matrices.
and C++, respectively. The FAO DAG representation of the final cone program can
be used by a matrix-free cone solver to solve the cone pro-
First-order. First-order methods are an alternative to gram. The modified canonicalization algorithm never forms
interior-point methods that scale more easily to large cone explicit matrix representations of linear functions. Hence
programs, at the cost of lower accuracy. PDOS [12] is we call the algorithm matrix-free canonicalization.
a first-order cone solver based on the alternating direc-
4.2. Informal overview
tion method of multipliers (ADMM) [7]. PDOS supports
second-order cone programs. POGS [21] is an ADMM In this section we give an informal overview of the
based solver that runs on a GPU, with a version that is sim- matrix-free canonicalization algorithm. A longer paper still
ilar to PDOS and targets second-order cone programs. SCS in development contains the full details [15].
is another ADMM-based cone solver, which supports sym- We are given an optimization problem
metric cone programs as well as the exponential and power
cones [41]. Many other first-order algorithms can be ap- minimize f0 (x)
plied to cone programs (e.g., [33, 9, 42]), but none have subject to fi (x) ≤ 0, i = 1, . . . , p (7)
been implemented as a robust, general purpose cone solver. hi (x) + di = 0, i = 1, . . . , q,

where x ∈ Rn is the optimization variable, f0 : Rn →


Matrix-free. Matrix-free cone solvers are an area of ac- R, . . . , fp : Rn → R are convex functions, h1 : Rn →
tive research, and a small number have been developed. Rm1 , . . . , hq : Rn → Rmq are linear functions, and d1 ∈
PENNON is a matrix-free semidefinite program (SDP) Rm1 , . . . , dq ∈ Rmq are vector constants. Our goal is to
solver [32]. PENNON solves a series of unconstrained op- convert the problem into an equivalent matrix-free cone pro-
timization problems using Newton’s method. The Newton gram, so that we can solve it using a matrix-free cone solver.
step is computed using a preconditioned conjugate gradient We assume that the problem satisfies a set of require-
method, rather than by factoring the Hessian directly [30]. ments known as disciplined convex programming [27, 29].

680
The requirements ensure that each of the f0 , . . . , fp can be
represented as partial minimization over a cone program.
Let each function fi have the cone program representation
(i) (i)
fi (x) = minimize g0 (x, t(i) ) + e0
(i) (i) (i) (8)
subject to gj (x, t(i) ) + ej ∈ Kj ,
(i)
where the minimization is over the variable t(i) ∈ Rs , the
(i) (i)
constraints are indexed over j = 1, . . . , r(i) , g0 , . . . , gr(i)
(i) (i)
are linear functions, e0 , . . . , er(i) are vector constants, and
(i) (i)
K1 , . . . , Kr(i) are convex cones. For simplicity we assume
here that all cone elements are real vectors.
We rewrite problem (7) as the equivalent cone program
(0) (0)
minimize g0 (x, t(0) ) + e0
(i) (i)
subject to −g0 (x, t(i) ) − e0 ∈ K+
(i) (i) (i) (9) Figure 3: Results for a problem instance with n = 1000.
gj (x, t(i) ) + ej ∈ Kj
mi
hi (x) + di ∈ K0 ,
CVXPY modeling system, which represents the matrix A in
where constraints indexed by i and j are over the obvious a cone program as a sparse matrix and uses standard cone
ranges. We convert problem (9) into the standard form for a solvers.
(0)
matrix-free cone program given in (6) by representing g0 The Python code below constructs and solves problem
n+s(0)
as the inner product with a vector c ∈ R , concate- (1). The constants c and b and problem size n are defined
(i) elsewhere. The code is only a few lines, and it could be
nating the di and ej vectors into a single vector b, and
representing the matrix A implicitly as the linear function easily modified to add regularization on x or apply a differ-
(i)
that stacks the outputs of all the hi and gj (excluding the ent cost function to c ∗ x − b. The modeling system would
(0) automatically adapt to solve the modified problem.
objective g0 ) into a single vector.
# Construct the optimization problem.
5. Numerical results x = Variable(n)
5.1. Implementation cost = sum_squares(conv(c, x) - b)
prob = Problem(Minimize(cost),
We have implemented the matrix-free canonicalization [x >= 0])
algorithm as an extension of CVXPY [16], available at # Solve using matrix-free SCS.
https://github.com/SteveDiamond/cvxpy. prob.solve(solver=MAT_FREE_SCS)

To solve the resulting matrix-free cone programs, we imple- Problem instances. We used the following procedure to
mented modified versions of SCS [41] and POGS [21] that generate interesting (nontrivial) instances of problem (1).
are truly matrix-free, available at For all instances the vector c ∈ Rn was a Gaussian ker-
https://github.com/SteveDiamond/scs, nel with standard deviation n/10. All entries of c less than
https://github.com/SteveDiamond/pogs. 10−6 were set to 10−6 , so that no entries were too close to
zero. The vector b ∈ R2n−1 was generated by picking a
(The details of these modification will be described in future solution x̃ with 5 entries randomly chosen to be nonzero.
work.) Our implementations are still preliminary and can The values of the nonzero entries were chosen uniformly at
be improved in many ways. We also emphasize that the random from the interval [0, n/10]. We set b = c ∗ x̃ + v,
canonicalization is independent of the particular matrix-free where the entries of the noise vector v ∈ R2n−1 were drawn
cone solver used. from a normal distribution with mean zero and variance
kc ∗ x̃k2 /(400(2n − 1)). Our choice of v yielded a signal-
5.2. Nonnegative deconvolution
to-noise ratio near 20.
We applied our matrix-free convex optimization mod- While not relevant to solving the optimization problem,
eling system to the nonnegative deconvolution problem the solution of the nonnegative deconvolution problem of-
(1). We compare the performance with that of the current ten, but not always, (approximately) recovers the original

681
References
[1] S. Akle. Algorithms for unsymmetric cone optimization and
an implementation for problems with the exponential cone.
PhD thesis, Stanford University, 2015.
[2] M. Andersen, J. Dahl, and L. Vandenberghe. CVX-
OPT: Python software for convex optimization, version 1.1.
http://cvxopt.org/, May 2015.
[3] A. Beck and M. Teboulle. Fast gradient-based algorithms
for constrained total variation image denoising and deblur-
ring problems. IEEE Transactions on Image Processing,
18(11):2419–2434, Nov. 2009.
[4] A. Beck and M. Teboulle. A fast iterative shrinkage-
thresholding algorithm for linear inverse problems. SIAM
Journal on Imaging Sciences, 2(1):183–202, 2009.
[5] S. Becker, E. Candès, and M. Grant. Templates for con-
vex cone problems with applications to sparse signal recov-
ery. Mathematical Programming Computation, 3(3):165–
Figure 4: Solve time in seconds T versus variable size n. 218, 2011.
[6] S. Benson and Y. Ye. DSDP5: Software for semidefinite
programming. Technical Report ANL/MCS-P1289-0905,
vector x̃. Figure 3 shows the solution recovered by ECOS Mathematics and Computer Science Division, Argonne Na-
[17] for a problem instance with n = 1000. The ECOS tional Laboratory, Argonne, IL, Sept. 2005. Submitted to
solution x⋆ had a cluster of 3-5 adjacent nonzero entries ACM Transactions on Mathematical Software.
around each spike in x̃. The sum of the entries was close to [7] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Dis-
the value of the spike. The recovered x in figure 3 shows tributed optimization and statistical learning via the alternat-
only the largest entry in each cluster, with value set to the ing direction method of multipliers. Foundations and Trends
sum of the cluster’s entries. in Machine Learning, 3:1–122, 2011.
[8] S. Boyd and L. Vandenberghe. Convex Optimization. Cam-
bridge University Press, 2004.
Results. Figure 4 compares the performance on problem [9] A. Chambolle and T. Pock. A first-order primal-dual al-
gorithm for convex problems with applications to imaging.
(1) of the interior-point solver ECOS [17] and matrix-free
Journal of Mathematical Imaging and Vision, 40(1):120–
versions of SCS and POGS as the size n of the optimiza-
145, May 2011.
tion variable increases. We limited the solvers to 104 sec-
[10] T. Chan, S. Esedoglu, and M. Nikolova. Algorithms for
onds. ECOS and matrix-free SCS were run serially on a finding global minimizers of image segmentation and de-
Intel Xeon processor, while matrix-free POGS was run on a noising models. SIAM Journal on Applied Mathematics,
Titan X GPU. 66(5):1632–1648, 2006.
For each variable size n we generated ten different prob- [11] S. Chen, D. Donoho, and M. Saunders. Atomic decomposi-
lem instances and recorded the average solve time for each tion by basis pursuit. SIAM Journal on Scientific Computing,
solver. ECOS and matrix-free SCS were run with an abso- 20(1):33–61, 1998.
lute and relative tolerance of 10−3 for the duality gap, ℓ2 [12] E. Chu, B. O’Donoghue, N. Parikh, and S. Boyd. A
norm of the primal residual, and ℓ2 norm of the dual resid- primal-dual operator splitting method for conic optimiza-
ual. Matrix-free POGS was run with an absolute tolerance tion. Preprint, 2013. http://stanford.edu/~boyd/
of 10−4 and a relative tolerance of 10−3 . papers/pdf/pdos.pdf.
[13] J. Cooley and J. Tukey. An algorithm for the machine calcu-
The slopes of the lines show how the solvers scale. The
lation of complex Fourier series. Mathematics of computa-
least-squares linear fit for the ECOS solve times has slope
tion, 19(90):297–301, 1965.
3.1, which indicates that the solve time scales like n3 , as ex-
[14] T. Davis. Direct Methods for Sparse Linear Systems (Fun-
pected. The least-squares linear fit for the matrix-free SCS damentals of Algorithms 2). SIAM, Philadelphia, PA, USA,
solve times has slope 1.3, which indicates that the solve 2006.
time scales like the expected n log n. The least-squares lin- [15] S. Diamond and S. Boyd. Matrix-free convex optimization
ear fit for the matrix-free POGS solve times in the range modeling. Preprint, 2015. http://arxiv.org/pdf/
n ∈ [105 , 107 ] has slope 1.1, which indicates that the solve 1506.00760v1.pdf.
time scales like the expected n log n. For n < 105 , the GPU [16] S. Diamond, E. Chu, and S. Boyd. CVXPY: A Python-
overhead (launching kernels, etc.) dominates, and the solve embedded modeling language for convex optimization, ver-
time is nearly constant. sion 0.2. http://cvxpy.org/, May 2014.

682
[17] A. Domahidi, E. Chu, and S. Boyd. ECOS: An SOCP solver [34] E. Liberty. Simple and deterministic matrix sketching. In
for embedded systems. In Proceedings of the European Con- Proceedings of the 19th ACM SIGKDD International Con-
trol Conference, pages 3071–3076, 2013. ference on Knowledge Discovery and Data Mining, pages
[18] M. Figueiredo, R. Nowak, and S. Wright. Gradient projec- 581–588, 2013.
tion for sparse reconstruction: Application to compressed [35] J. Lim. Two-dimensional Signal and Image Processing.
sensing and other inverse problems. IEEE Journal of Se- Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1990.
lected Topics in Signal Processing, 1(4):586–597, Dec. 2007. [36] Y. Lin, D. Lee, and L. Saul. Nonnegative deconvolution for
[19] D. Fong and M. Saunders. LSMR: An iterative algorithm for time of arrival estimation. In Proceedings of the IEEE Inter-
sparse least-squares problems. SIAM Journal on Scientific national Conference on Acoustics, Speech, and Signal Pro-
Computing, 33(5):2950–2971, 2011. cessing, volume 2, pages 377–380, May 2004.
[20] D. Forsyth and J. Ponce. Computer Vision: A Modern Ap- [37] C. V. Loan. Computational Frameworks for the Fast Fourier
proach. Prentice Hall Professional Technical Reference, Transform. SIAM, 1992.
2002. [38] J. Lofberg. YALMIP: A toolbox for modeling and opti-
[21] C. Fougner and S. Boyd. Parameter selection and pre- mization in MATLAB. In Proceedings of the IEEE Interna-
conditioning for a graph form solver. Preprint, 2015. http: tional Symposium on Computed Aided Control Systems De-
//arxiv.org/pdf/1503.08366v1.pdf. sign, pages 294–289, Sept. 2004.
[22] K. Fountoulakis, J. Gondzio, and P. Zhlobich. Matrix- [39] MOSEK optimization software, version 7. https://
free interior point method for compressed sensing prob- mosek.com/, Jan. 2015.
lems. Preprint, 2012. http://arxiv.org/pdf/ [40] Y. Nesterov and A. Nemirovsky. Conic formulation of a con-
1208.5435.pdf. vex programming problem and duality. Optimization Meth-
[23] K. Fujisawa, M. Fukuda, K. Kobayashi, M. Kojima, ods and Software, 1(2):95–115, 1992.
K. Nakata, M. Nakata, and M. Yamashita. SDPA (semidefi- [41] B. O’Donoghue, E. Chu, N. Parikh, and S. Boyd. Conic op-
nite programming algorithm) user’s manual – version 7.0.5. timization via operator splitting and homogeneous self-dual
Technical report, 2008. embedding. Preprint, 2015. http://stanford.edu/
~boyd/papers/pdf/scs.pdf.
[24] M. Fukuda, M. Kojima, and M. Shida. Lagrangian dual
interior-point methods for semidefinite programs. SIAM [42] T. Pock and A. Chambolle. Diagonal preconditioning for
Journal on Optimization, 12(4):1007–1031, 2002. first order primal-dual algorithms in convex optimization. In
Proceedings of the IEEE International Conference on Com-
[25] T. Goldstein and S. Osher. The split Bregman method for ℓ1 -
puter Vision, pages 1762–1769, 2011.
regularized problems. SIAM Journal on Imaging Sciences,
2(2):323–343, 2009. [43] M. Saunders, B. Kim, C. Maes, S. Akle, and M. Zahr.
PDCO: Primal-dual interior method for convex objec-
[26] J. Gondzio. Matrix-free interior point method. Computa-
tives. http://web.stanford.edu/group/SOL/
tional Optimization and Applications, 51(2):457–480, 2012.
software/pdco/, Nov. 2013.
[27] M. Grant. Disciplined Convex Programming. PhD thesis,
[44] J. Sturm. Using SeDuMi 1.02, a MATLAB toolbox for opti-
Stanford University, 2004.
mization over symmetric cones. Optimization Methods and
[28] M. Grant and S. Boyd. CVX: MATLAB software for disci- Software, 11(1-4):625–653, 1999.
plined convex programming, version 2.1. http://cvxr. [45] K.-C. Toh. Solving large scale semidefinite programs via an
com/cvx, Mar. 2014. iterative solver on the augmented systems. SIAM Journal on
[29] M. Grant, S. Boyd, and Y. Ye. Disciplined convex pro- Optimization, 14(3):670–698, 2004.
gramming. In L. Liberti and N. Maculan, editors, Global [46] K.-C. Toh, M. Todd, and R. Tütüncü. SDPT3 — a MATLAB
Optimization: From Theory to Implementation, Nonconvex software package for semidefinite programming, version 4.0.
Optimization and its Applications, pages 155–210. Springer, Optimization Methods and Software, 11:545–581, 1999.
2006. [47] M. Udell, K. Mohan, D. Zeng, J. Hong, S. Diamond, and
[30] M. Hestenes and E. Stiefel. Methods of conjugate gradients S. Boyd. Convex optimization in Julia. SC14 Workshop
for solving linear systems. J. Res. N.B.S., 49(6):409–436, on High Performance Technical Computing in Dynamic Lan-
1952. guages, 2014.
[31] S.-J. Kim, K. Koh, M. Lustig, S. Boyd, and D. Gorinevsky. [48] E. van den Berg and M. Friedlander. Probing the Pareto fron-
An interior-point method for large-scale ℓ1 -regularized least tier for basis pursuit solutions. SIAM Journal on Scientific
squares. IEEE Journal on Selected Topics in Signal Process- Computing, 31(2):890–912, 2009.
ing, 1(4):606–617, Dec. 2007. [49] C. Zach, T. Pock, and H. Bischof. A duality based approach
[32] M. Koc̆vara and M. Stingl. On the solution of large-scale for realtime TV-ℓ1 optical flow. In Pattern Recognition, vol-
SDP problems by the modified barrier method using itera- ume 4713 of Lecture Notes in Computer Science, pages 214–
tive solvers. Mathematical Programming, 120(1):285–287, 223. Springer Berlin Heidelberg, 2007.
2009. [50] X.-Y. Zhao, D. Sun, and K.-C. Toh. A Newton-CG aug-
[33] G. Lan, Z. Lu, and R. Monteiro. Primal-dual first-order mented Lagrangian method for semidefinite programming.
methods with O(1/ǫ) iteration-complexity for cone pro- SIAM Journal on Optimization, 20(4):1737–1765, 2010.
gramming. Mathematical Programming, 126(1):1–29, 2011.

683

You might also like