Nips17 DPML Tutorial

Differentially Private
Machine Learning
Theory, Algorithms, and Applications
Kamalika Chaudhuri Anand D. Sarwate

(UCSD) (Rutgers)
Logistics and Goals
• Tutorial Time: 2 hr (15 min break after first hour)
• What this tutorial will do:
• Motivate and define differential privacy
• Provide overview of common methods and tools to
design differentially private ML algorithms
• What this tutorial will not do:
• Provide detailed results on the state of the art in
differentially private versions of specific problems
Learning Outcomes
At the end of the tutorial, you should be able to:
• Explain the definition of differential privacy,
• Design basic differentially private machine learning
algorithms using standard tools,
• Try different approaches for introducing differential
privacy into optimization methods,
• Understand the basics of privacy risk accounting,
• Understand how these ideas blend together in more
complex systems.
Motivating
Differential Privacy
Sensitive Data
Medical Records
Genetic Data
Search Logs
AOL Violates Privacy
AOL Violates Privacy
Netflix Violates Privacy [NS08]
Movies%
User%1%
User%2%
User%3%
2-8 movie-ratings and dates for Alice reveals:

Whether Alice is in the dataset or not
Alice’s other movie ratings
High-dimensional Data is Unique
Example: UCSD Employee Salary Table
Position Gender Department Ethnicity Salary
Faculty Female CSE SE Asian -
One employee (Kamalika) fits description!

Simply anonymizing data is unsafe!
Disease Association Studies [WLWTZ09]
Cancer Healthy
Correlations Correlations
Correlation (R2 values), Alice’s DNA reveals:

If Alice is in the Cancer set or Healthy set
Simply anonymizing data is unsafe!
Statistics on small data sets is unsafe!
Privacy
Data Size Accuracy

Privacy Definition
The Setting
privacy barrier
Statistics
summary
statistic
Privacy-preserving synthetic
Data release
Private
data set sanitizer dataset
D ML model
Data Sanitizer Public
non-public
(sensitive) public
Property of Sanitizer
privacy barrier
summary
Statistics
statistic
Privacy-preserving synthetic
Private
data set sanitizer dataset
Data release
D ML model
Datanon-public
Sanitizer public
Public
Aggregate information computable

Individual information protected
(robust to side-information)
Data + Algorithm Outcome
Participation of a person does not change outcome
Since a person has agency, they can decide

to participate in a dataset or not
Adversary
Prior Knowledge:
A’s Genetic profile
A smokes
Case 1: Study
A has
cancer
Adversary
Cancer
[ Study violates A’s privacy ]
Prior Knowledge:
A’s Genetic profile
A smokes
Case 1: Study
A has
cancer
Adversary
Cancer
[ Study violates A’s privacy ]
Case 2: Study
Prior Knowledge:
A probably
A’s Genetic profile has cancer
A smokes
Smoking causes cancer

[ Study does not violate privacy]
Differential Privacy [DMNS06]
Participation of a person does not change outcome
Since a person has agency, they can decide

to participate in a dataset or not
How to ensure this?
…through randomness
A( Data + )
Random have close
variables distributions
A(Data + )
How to ensure this?
A( Data + )
Random have close
variables distributions
A(Data + )
Randomness: Added by randomized algorithm A

Closeness: Likelihood ratio at every point bounded
D D’
p[A(D) = t] p[A(D’) = t]
t
For all D, D’ that differ in one person’s value,
If A = ✏ -differentially private randomized algorithm, then:
p(A(D) = t)
sup log 0
✏
t p(A(D ) = t)
D D’
p[A(D) = t] p[A(D’) = t]
t
If A = ✏ -differentially private randomized algorithm, then:
Max-divergence of p(A(D) = t)
sup log ✏
p(A(D)) and p(A(D’)) t
0
p(A(D ) = t)
Approx. Differential Privacy [DKM+06]
D D’
p[A(D) = t] p[A(D’) = t]
t
If A = (✏, ) -differentially private randomized algorithm, then:
h Pr(A(D) 2 S) i
max log ✏
S,Pr(A(D)2S)> Pr(A(D0 ) 2 S
Properties of
Property 1: Post-processing Invariance
privacy barrier
legitimate user 1 (✏1 , 1)
eriv ative
d
data
differentially non-private data derivative (✏1 ,
Private private post-
legitimate user 2 1)
data set algorithm processing
D data der
ivative
adversary (✏1 ,
(✏1 , 1) 1)
Risk doesn’t increase if you don’t touch the data again

Property 2: Graceful
privacy barrier
Composition
Preprocessing (" ,
1 1)
ATraining
1 (✏1 , ("
1) ,
2 2)
X X
Private ( ✏i , i)
data set A2 (✏2 , 2)
Cross-
validation
("3 , 3)
i i
D …
Testing ("4 , 4)
A
Total privacy loss is the sum of privacy losses

(Better composition possible — coming up later)
How to achieve
Differential Privacy?
Tools for Differentially Private
Algorithm Design
• Global Sensitivity Method [DMNS06]

• Exponential Mechanism [MT07]
Many others we will not cover [DL09, NRS07, …]

Global Sensitivity Method [DMNS06]
Problem:
Given function f, sensitive dataset D
Find a differentially private approximation to f(D)
Example: f(D) = mean of data points in D

The Global Sensitivity Method [DMNS06]
Given: A function f, sensitive dataset D

Define: dist(D, D’) = #records that D, D’ differ by

Global Sensitivity of f:

S(f) = | f(D) - f(D’)|
Domain(D)
D’
D

S(f) = | f(D) - f(D’)|
dist(D, D’) = 1
Domain(D)
D’
D

S(f) = max | f(D) - f(D’)|
dist(D, D’) = 1
Domain(D)
D’
D D’
D
Laplace Mechanism
Global Sensitivity of f is S(f) = max | f(D) - f(D’)|

dist(D, D’) = 1
Output where
✏-differentially
private
Laplace distribution:
Gaussian Mechanism
Global Sensitivity of f is S(f) = max | f(D) - f(D’)|

dist(D, D’) = 1
Output where
S(f )
Z⇠ N (0, 2 ln(1.25/ )) (✏, )-differentially
✏
private
Example 1: Mean
f(D) = Mean(D), where each record is a scalar in [0,1]

Example 1: Mean

Global Sensitivity of f = 1/n
Example 1: Mean

Global Sensitivity of f = 1/n
Laplace Mechanism:
1
Output where Z⇠
n✏
Lap(0, 1)
How to get Differential Privacy?
• Global Sensitivity Method [DMNS06]

• Two variants: Laplace and Gaussian
• Exponential Mechanism [MT07]
Many others we will not cover [DL09, NRS07, …]
Exponential Mechanism [MT07]
Problem:
Given function f(w, D), Sensitive Data D
Find differentially private approximation to
⇤
w = argmax f (w, D)
w
Example: f(w, D) = accuracy of classifier w on dataset D

The Exponential Mechanism [MT07]
Suppose for any w,

0
|f (w, D) f (w, D )|  S
when D and D’ differ in 1 record. Sample w from:
✏f (w,D)/2S
p(w) / e
for ✏-differential privacy.
argmax f(w, D)
f(w, D) Pr(w)
w w
Example: Parameter Tuning
Given validation data D, k classifiers w1, .., wk

(privately) find the classifier with highest accuracy on D
Here, f(w, D) = classification accuracy of w on D

For any w, any D and D’ that differ by one record,
0 1
|f (w, D) f (w, D )| 
|D|
So, the exponential mechanism outputs wi with prob:
✏|D|f (wi ,D)/2
Pr(wi ) / e
[CMS11, CV13]
Summary
— Motivation
— What is differential privacy?
— Basic differential privacy algorithm design tools:
— The Global Sensitivity Method
— Laplace Mechanism
— Gaussian Mechanism
— Exponential Mechanism
Differential privacy in
estimation and prediction
Estimation and prediction problems
privacy barrier
f (w, ·) risk functional
Private (", )
data set
DP estimate of
D ŵ private estimator
argmin f (w, D)
sample size n w E[f (ŵ, z)] E[f (w⇤ , z)]
Statistical estimation: estimate a parameter or

predictor using private data that has good expected
performance on future data.
Goal: Good privacy-accuracy-sample size tradeoff
Privacy and accuracy make different
assumptions about the data
privacy barrier
f (w, ·) risk functional
Private (", )
data set
DP estimate of
D ŵ private estimator
argmin f (w, D)
sample size n w E[f (ŵ, z)] E[f (w⇤ , z)]
Privacy – differential privacy makes no assumptions on

the data distribution: privacy holds unconditionally.
Accuracy – accuracy measured w.r.t a “true population

distribution”: expected excess statistical risk.
Statistical Learning as Risk Minimization
Xn
⇤ 1
w = argmin `(w, (xi , yi )) + R(w)
w n i=1
• Empirical Risk Minimization (ERM) is a common paradigm

for prediction problems.
• Produces a predictor w for a label/response y given a

vector of features/covariates x.
• Typically use a convex loss function and regularizer to

“prevent overfitting.”
Why is ERM not private?
w⇤
+
+ + –+ easy for adversary to tell the
difference between D and D’
– + + Private
– – data set
D or D0 ?
– – D
w
0⇤
w
+
+ + –+
adversary
– + Private
– – data set
+ – – D 0
[CMS11, RBHT12]
Kernel learning: even worse
• Kernel-based methods produce a classifier that is a

function of the data points.
• Even adversary with black-box access to w could

potentially learn those points.
+
+ + –+
n
X
w(x) = ↵i k(x, xi ) – + +
i=1
– –
– –
[CMS11]
Privacy is compatible with learning
• Good learning algorithms generalize to the population

distribution, not individuals.
• Stable learning algorithms generalize [BE02].
• Differential privacy can be interpreted as a form of stability

that also implies generalization [DFH+15,BNS+16].
• Two parts of the same story:

Privacy implies generalization asymptotically.
Tradeoffs between privacy-accuracy-sample size for finite n.
Revisiting ERM
Xn
⇤ 1
w = argmin `(w, (xi , yi )) + R(w)
w n i=1
• Learning using (convex) optimization uses three steps:
1. read in the data input perturbation
2. form the objective function objective perturbation
3. perform the minimization output perturbation
• We can try to introduce privacy in each step!

Privacy in ERM: options
input
perturbation
non-private
input preprocessing
perturbation
input
private
private sanitized
perturbation
database
database non-private
sanitized dataset non-non-
input
DPDP
optimization
database
optimization
preprocessing
ŵ
private
private
objective
perturbation algorithm
learning
D input
perturbation
(",(",)w
)
⇤
noise addition
input
perturbation
or
randomization
ŵoutput
input
privacy (", ) privacy
privacy
(", ) barrier barrier
barrier
Local Privacy
input
perturbation
input
perturbation
input
perturbation
sanitized non-
input database private
perturbation algorithm
input
perturbation
input
perturbation
privacy
(", ) barrier
• Local privacy: data contributors sanitize data before collection.
• Classical technique: randomized response [W65].
• Interactive variant can be minimax optimal [DJW13].

Input Perturbation
private sanitized
database dataset non-
DP
private
preprocessing
learning
D (", )
ŵinput
privacy
barrier
• Input perturbation: add noise to the input data.
• Advantages: easy to implement, results in reusable

sanitized data set.
[DJW13,FTS17]
Output Perturbation
non-private
preprocessing
private
database non-private
optimization
D w ⇤
noise addition
or
randomization
ŵoutput
(", ) privacy
barrier
• Compute the minimizer and add noise.
• Does not require re-engineering baseline algorithms
Noise depends on the sensitivity of the argmin.

[CMS11, RBHT12]
Objective Perturbation
non-private
preprocessing
private
database
DP
optimization
ŵobjective
D (", )
privacy
barrier
Objective Perturbation
Xn
1
J(w) = `(w, (xi , yi )) + R(w)
n i=1
A. Add a random term to the objective:
ŵpriv = argmin J(w) + w> b

w
B. Do a randomized approximation of the objective:
ˆ
ŵpriv = argmin J(w)
w
Randomness depends on the sensitivity properties of J(w).

[CMS11, ZZX+12]
Sensitivity of the argmin
⇤ >
ŵpriv = argmin J(w) +
w = argmin J(w) w b
w w
⇤
• Non-private optimization solves rJ(w) = 0 =) w
• Generate vector analogue of Laplace: b ⇠ p(z) / e "/2kzk
• Private optimization solves rJ(w) = b =) wpriv
• Have to bound the sensitivity of the gradient.

Theoretical bounds on excess risk
Same important parameters:

output
input perturbation
objective perturbation
! • privacy parameters (ε, δ)
p
d log(1/ )
Õ • data dimension d
n"
• data bounds kxi k  B
✓✓ ◆◆
d • analytical properties of the loss
ÕÕ 3/2 (Lipschitz, smoothness)
n"n"
(quadratic loss) • regularization parameter
[CMS11, BST14]
[FTS17]
Typical empirical results
In general:
[CMS11]
• Objective perturbation
empirically outperforms output
perturbation.
• Gaussian mechanism with (ε, δ)

guarantees outperform Laplace -
like mechanisms with ε-
guarantees.
[JT14]
• Loss vs. non-private methods is
very dataset-dependent.
Gaps between theory and practice
• Theoretical analysis is for fixed privacy parameters –

how should we choose them in practice?
• Given a data set, can I tell what the privacy-utility-

sample-size tradeoff is?
• What about more general optimization problems/

algorithms?
• What about scaling (computationally) to large data sets?

Summary
• Training does not on its own guarantee privacy.
• There are many ways to incorporate DP into prediction

and learning using ERM with different privacy-accuracy-
sample size tradeoffs.
• Good DP algorithms should generalize since they learn

about populations, not individuals.
• Theory and experiment show that (ε, δ)-DP algorithms

have better accuracy than ε-DP algorithms at the cost of
a weaker privacy guarantee.
Break
Differential privacy and
optimization algorithms
Scaling up private optimization
• Large data sets are challenging for optimization:

batch methods not feasible
• Using more data can help our tradeoffs look better:

better privacy and accuracy
• Online learning involves multiple releases:

potential for more privacy loss
Goal: guarantee privacy using the optimization algorithm.

Stochastic Gradient Descent
• Stochastic gradient descent (SGD) is a moderately

popular method for optimization
• Stochastic gradients are random

already noisy already private?
• Optimization is iterative
intermediate results leak information
Non-private SGD
Xn
1
J(w) = `(w, (xi , yi )) + R(w)
n i=1
w0 = 0 • select a random data point

For t = 1, 2, . . . , T
• take a gradient step
it ⇠ Unif{1, 2, . . . , n}
gt = r`(wt 1 , (xit , yit )) + rR(wt 1)
wt = ⇧W (wt 1 ⌘ t gt )
ŵ = wT
Private SGD with noise
Xn
1
J(w) = `(w, (xi , yi )) + R(w)
n i=1
w0 = 0 • select random data point

For t = 1, 2, . . . , T
• add noise to gradient
it ⇠ Unif{1, 2, . . . , n}
zt ⇠ p(", ) (z)
ĝt = zt + r`(wt 1 , (xit , yit )) + rR(wt 1)
wt = ⇧W (wt 1 ⌘t ĝt )
[SCS15]
ŵ = wT
Choosing a noise distribution
“Laplace” mechanism Gaussian mechanism

✓ ✓ ◆◆
("/2)kzk i.i.d. log(1/ )
p(z) / e p(z) ⇠ N 0, Õ
"2
" DP (", ) DP
• Have to choose noise according to the sensitivity of the
gradient:
0
max0 max krJ(w; D) rJ(w; D )k
D,D w
• Sensitivity depends on the data distribution, Lipschitz

parameter of the loss, etc.
Private SGD with randomized selection
Xn
1
J(w) = `(w, (xi , yi )) + R(w)
n i=1
• select random data point

w0 = 0
For t = 1, 2, . . . , T • randomly select unbiased
it ⇠ Unif{1, 2, . . . , n} gradient estimate
gt = r`(wt 1 , (xit , yit )) + rR(wt 1)
ĝt ⇠ p(", ),g (z)
wt = ⇧W (wt 1 ⌘t ĝt )
[DJW13]
ŵ = wT
Randomized directions
[DJW13] L 1 kgk
=
kgk
g w.p.
2
+
2L Select hemisphere in
✏ direction of gradient
e or opposite.
1 + e✏ Pick uniformly from
1 the hemisphere and
B
1 + e✏ take a step
• Need to have control of gradient norms: kgk  L
• Keep some probability of going in the wrong direction.

Why does DP-SGD work?
zt zt =
L
kgk
g w.p.
1 kgk
2
+
2L
e✏
g+
1 + e✏
1
ĝ =
B
g 1 + e✏
Noisy Gradient Random Gradient
Choose noise distribution using Randomly select direction biased

the sensitivity of the gradient. towards the true gradient.
Both methods
• Guarantee DP at each iteration.

• Ensure unbiased estimate of g to guarantee convergence
Making DP-SGD more practical
“SGD is robust to noise”
• True up to a point — for small epsilon (more privacy), the

gradients can become too noisy.
2
• Solution 1: more iterations ([BST14]: need O(n ))
• Solution 2: use standard tricks: mini-batching, etc. [SCS13]
• Solution 3: use better analysis to show the privacy loss is

not so bad [BST14][ACG+16]
Randomly sampling data can amplify
privacy
privacy barrier
2"✏
Differentially
private algorithm
Private
data set
subsample
D n
• Suppose we have an algorithm A which is ε-differentially

private for ε ≤ 1.
• Sample γn entries of D uniformly at random and run A

on those.
• Randomized method guarantees 2γε-differential privacy.

[BBKN14,BST14]
Summary
• Stochastic gradient descent can be made differentially

private in several ways by randomizing the gradient.
• Keeping gradient estimates unbiased will help ensure

convergence.
• Standard approach for variance reduction/stability (such

as minibatching) can help with performance.
• Random subsampling of the data can amplify privacy

guarantees.
Accounting for total
privacy risk
Measuring total privacy loss
privacy barrier
legitimate user 1 (✏1 , 1)
ivative
a d e r
dat
differentially non-private data derivative (✏1 ,
Private private post-
legitimate user 2 1)
data set algorithm processing
D data der
ivative
adversary (✏1 ,
(✏1 , 1) 1)
Post processing invariance: risk doesn’t increase if you don’t

touch the data again
• more complex algorithms have multiple stages
all stages have to guarantee DP
• need a way to do privacy accounting: what is lost over time/
multiple queries?
A simple example
privacy barrier
Preprocessing ("1 , 1)
Training ("2 , 2)
Private
data set Cross-
validation
("3 , 3)
D
Testing ("4 , 4)
Composition property of
differential privacy
Basic composition: privacy loss is additive:
• Apply R algorithms with (✏i , i ) : i = 1, 2, . . . , R
• Total privacy loss: !

R
X R
X
✏i , i
i=1 i=1
• Worst-case analysis: each result exposes the worst

privacy risk.
What composition says about
multi-stage methods
privacy barrier
Training ("2 , 2)
Private
data set Cross-
validation
("3 , 3)
D
Testing ("4 , 4)
Total privacy loss is the sum of the privacy losses…

An open question: privacy
allocation across stages
Compositions means we have a privacy budget.
How should we allocate privacy risk across different stages

of a pipeline?
• Noisy features + accurate training?
• Clean features + sloppy training?
It’s application dependent! Still an open question…

A closer look at ε and δ
Gaussian noise of a given variance produces a spectrum of

(ε, δ) guarantees:
p(⌧2 |D)
1
2
|D)
p(⌧1p(⌧ 2 |D)
 exp(" 2 )
p(⌧ |D)
p(⌧21|D 0)
0
 exp("1 )
p(⌧1 |D )
p(⌧2 |D00 )
p(⌧1 |D )
⌧1 ⌧2
Privacy loss as a random
variable
Spectrum of (ε, δ)
guarantees means we can
trade off ε and δ when
analyzing a particular
mechanism.
Actual privacy loss is a random variable that depends on D:

p(A(D) = t)
ZD,D0 = log 0
w.p. p(A(D) = t)
p(A(D ) = t)
Random privacy loss
p(A(D) = t)
ZD,D0 = log w.p. p(A(D) = t)
p(A(D0 ) = t)
• Bounding the max loss over (D,D’) is still a random

variable.
• Sequentially computing functions on private data is like

sequentially sampling independent privacy losses.
• Concentration of measure shows that the loss is much

closer to its expectation.
Strong composition bounds
k times
z }| {
(", ), (", ), . . . , (", ) • Given only the (ε, δ)
guarantees for k
algorithms operating
on the data.
• Composition again
(k 2i)", 1 (1 )k (1 i) gives a family of
(ε, δ) tradeoffs: can
quantify privacy loss
Pi 1 k by choosing any
`=0 ` e(k `)" e(k 2i+`)"
valid (ε, δ) pair.
i =
(1 + e" )k
[DRV10, KOV15]
Moments accountant
privacy barrier
A
Training1 ("2 , 2)
Private
data set A 2 A (✏, )
Cross-
validation
("3 , 3)
D …
Testing ("4 , 4)
Basic Idea: Directly calculate parameters (ε,δ)
from composing a sequence of mechanisms
More efficient than composition theorems
[ACG+16]
How to Compose Directly?
Given datasets D and D’ with one different record,
mechanism A, define privacy loss random variable as:
p(A(D) = t)
ZD,D0 = log 0
, w.p. p(A(D) = t)
p(A(D ) = t)
- Properties of ZD,D’
related to privacy loss of A
- If max absolute value of ZD,D’

over all D, D’ is ε, then A is
(ε,0)-differentially private
How to Compose Directly?
Given datasets D and D’ with one different record,
mechanism A, define privacy loss random variable as:
p(A(D) = t)
ZD,D0 = log 0
, w.p. p(A(D) = t)
p(A(D ) = t)
Challenge: To reason about

the worst case over all D, D’
Key idea in [ACG+16]: Use

moment generating functions
Accounting for Moments…
privacy barrier
A
Training1 ("2 , 2)
Private
Cross-
validation
("3 , 3)
D …
Testing ("4 , 4)
Three Steps:
1. Calculate moment generating functions for A1, A2, ..
2. Compose
3. Calculate final privacy parameters
[ACG+16]
1. Stepwise Moments
privacy barrier
A
Training1 ("2 , 2)
Private
data set A 2 A
Cross-
validation
("3 , 3)
D …
Testing ("4 , 4)
Define: Stepwise Moment at time t of At at any s:
↵At (s) = sup log E[e sZD,D0
] (D and D’ differ
D,D 0 by one record)
[ACG+16]
2. Compose
privacy barrier
A
Training1 ("2 , 2)
Private
data set A 2 A
Cross-
validation
("3 , 3)
D …
Testing ("4 , 4)
Theorem: Suppose A = (A1, …, AT). For any s:

T
X
↵A (s)  ↵At (s)
t=1
[ACG+16]
3. Final Calculation
privacy barrier
A
Training1 ("2 , 2)
Private
Cross-
validation
("3 , 3)
D …
Testing ("4 , 4)
Theorem: For any ε, mechanism A is (ε,δ)-DP for

= min exp(↵A (s) s✏)
s
Use theorem to find best ε for a given δ from closed form
or by searching over s1, s2, .., sk [ACG+16]
Example: composing Gaussian
mechanism("1 , 1 )
privacy barrier
Preprocessing
A
Training1 ("2 , 2)
Private
Cross-
validation
("3 , 3)
D …
Testing ("4 , 4)
Suppose At answers a query with global sensitivity 1 by

adding N(0, 1) noise
1. Stepwise Moments
privacy barrier
A
Training1 ("2 , 2)
Private
data set A 2 A
Cross-
validation
("3 , 3)
D …
Testing ("4 , 4)

s(s + 1)
Simple algebra gives for any s: ↵At (s) =
2
2. Compose
privacy barrier
A
Training1 ("2 , 2)
Private
data set A 2 A
Cross-
validation
("3 , 3)
D …
Testing ("4 , 4)

XT
T s(s + 1)
↵A (s)  ↵At (s) =
t=1
2
3. Final Calculation
privacy barrier
A
Training1 ("2 , 2)
Private
Cross-
validation
("3 , 3)
D …
Testing ("4 , 4)
Find lowest δ for a given ε (or vice versa) by solving:

= min exp(T s(s + 1)/2 s✏)
s
In this case, solution can be found in closed form.
How does it compare?
[DRV10]
(better than linear)
epsilon
Moments
Accountant
#rounds of composition
[ACG+16]
How does it compare
on real data?
EM for MOG
with Gaussian
Mechanism
= 10 4
[PFCW17]
Summary
• Practical machine learning looks at the data many times.
• Post-processing invariance means we just have to track

the cumulative privacy loss.
• Good composition methods use the fact that the actual

privacy loss may behave much better than the worst-case
bound.
• The Moments Accountant method tracks the actual

privacy loss more accurately: better analysis for better
privacy guarantees.
Applications to modern
machine learning
When is differential privacy practical?
Differential privacy is best suited for understanding

population-level statistics and structure:
• Inferences about the population should not depend
strongly on individuals.
• Large sample sizes usually mean lower sensitivity and
less noise.
To build and analyze systems we have to leverage post-
processing invariance and composition properties.
Differential privacy in practice
Google: RAPPOR for

tracking statistics in
Chrome.
Apple: various iPhone
usage statistics.
Census: 2020 US Census
will use differential privacy.
mostly focused on count and average statistics

Challenges for machine learning
applications
Differentially private ML is complicated because real ML

algorithms are complicated:
• Multi-stage pipelines, parameter tuning, etc.
• Need to “play around” with the data before
committing to a particular pipeline/algorithm.
• “Modern” ML approaches (= deep learning) have
many parameters and less theoretical guidance.
Some selected examples
posterior
original
L 1 , L2 , . . .
select random
lot new network
Private weights
data set
D
x L1 , x L2 , . . . compute
gradients
✓T
processed
posterior
moments clip and update
accountant add noise parameters
(", ) privacy
barrier
For today, we will describe some recent examples:

1. Differentially private deep learning [ACG+16]
2. Differential privacy and Bayesian inference
Differential privacy and deep learning
Main idea: train a deep

network using differentially
private SGD and use
moments accountant to
track privacy loss.
Additional components:
gradient clipping,
minibatching, data
augmentation, etc.
[ACG+16]
Overview of the algorithm
L 1 , L2 , . . .
select random
lot new network
Private weights
data set
D
x L1 , x L2 , . . . compute
gradients
✓T
moments clip and update
accountant add noise parameters
(", ) privacy
barrier
[ACG+16]
Effectiveness of DP deep learning
Empirical results on MNIST and CIFAR:

• Training and test error come close to baseline non-
private deep learning methods.
• To get moderate loss in performance, epsilon and
delta are not “negligible”
[ACG+16]
Moving forward in deep learning
This is a good proof of concept for differential privacy for

deep neural networks. There are lots of interesting
ways to expand this.
• Just used one NN model: what about other
architectures? RNNs? GANs?
• Can regularization methods for deep learning (e.g.
dropout) help with privacy?
• What are good rules of thumb for lot/batch size,
learning rate, # of hidden units, etc?
[ACG+16]
Bayesian Inference
Data X = { x1, x2, … }

Model Class ⇥ } Related through
likelihood p(x|✓)
+ =
Prior ⇡(✓) Data X Posterior p(✓|X)
Find differentially private approx to posterior

Bayesian Inference
• General methods for private posterior

approximation
• A Special Case: Exponential Families
• Variational Inference
How to make posterior private?
Option 1: Direct posterior sampling [DMNR14]
Not differentially private except under restrictive

conditions - likelihood ratios may be unbounded!
[GSC17] Answer changes under a new relaxation

Rényi differential privacy [M17]
Option 2: One Posterior Sample (OPS) Method [WFS15]
original posteriors processed posteriors
1. Truncate posterior so that likelihood ratio is

bounded in the truncated region.
2. Raise truncated posterior to a higher temperature

Option 2: One Posterior Sample (OPS) Method:
Advantage: General
Pitfalls:
— Intractable - only exact
distribution private
— Low statistical efficiency
even for large n
Option 3: Approximate the OPS distribution via
Stochastic Gradient MCMC [WFS15]
original posteriors processed posteriors

Advantage: Noise added during stochastic gradient
MCMC contributes to privacy
Disadvantage: Statistical efficiency lower than exact OPS

Bayesian Inference

approximation
Exponential Family Posteriors
(Non-private) posterior comes from exp. family:

> P
⌘(✓) ( T (xi )) B(✓)
p(✓|x) / e i
given data x1, x2, …

Posterior depends on data through sufficient statistic T
Exponential Family Posteriors
(Non-private) posterior comes from exp. family:

> P
⌘(✓) ( T (xi )) B(✓)
p(✓|x) / e i
given data x1, x2, …, sufficient statistic T

Private Sampling:
X
1. If T is bounded, add noise to T (xi ) to get private
version T’ i
2. Sample from the perturbed posterior:

⌘(✓)> T 0 B(✓)
p(✓|x) / e
[ZRD16, FGWC16]
How well does it work?
5
x 10
−1.5
−2
−2.5
Test−set log−likelihood
−3
−3.5
−4
Non−private HMM
−4.5 Non−private naive Bayes
Laplace mechanism HMM
OPS HMM (truncation multiplier = 100)
−5
−1 0 1
10 10 10
Epsilon (total)
Statistically efficient
Performance worse than non-private, better than OPS
Can do inference in relatively complex systems by building
up on this method — eg, time series clustering in HMMs
Bayesian Inference

approximation
Variational Inference
Key Idea: Start with

a stochastic
variational inference
method, and make
each step private by
adding Laplace noise.
Use moments
accountant and
subsampling to track
privacy loss.
[JDH16, PFCW16]
Summary
• Two examples of differentially private complex

machine learning algorithms
• Deep learning
• Bayesian inference
Summary, conclusions,
and where to look next
Summary
1. Differential privacy: basic definitions and mechanisms
2. Differential privacy and statistical learning: ERM and

SGD
3. Composition and tracking privacy loss
4. Applications of differential privacy in ML: deep learning

and Bayesian methods
Things we didn’t cover…
• Synthetic data generation.
• Interactive data analysis.
• Statistical/estimation theory and fundamental limits
• Feature learning and dimensionality reduction
• Systems questions for large-scale deployment
• … and many others…

Where to learn more
Several video lectures and other more technical

introductions available from the Simons Institute for the
Theory of Computing:
https://simons.berkeley.edu/workshops/bigdata2013-4
Monograph by Dwork and Roth:
http://www.nowpublishers.com/article/Details/TCS-042
Final Takeaways
• Differential privacy measures the risk incurred by

algorithms operating on private data.
• Commonly-used tools in machine learning can be made

differentially private.
• Accounting for total privacy loss can enable more

complex private algorithms.
• Still lots of work to be done in both theory and practice.

Thanks!
This work was supported by
National Science Foundation (IIS-1253942, CCF-1453432, SaTC-1617849)
National Institutes of Health (1R01DA040487-01A1)
Office of Naval Research (N00014-16-1-2616)
DARPA and the US Navy (N66001-15-C-4070)
Google Faculty Research Award

Nips17 DPML Tutorial

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Nips17 DPML Tutorial

Uploaded by

Copyright:

Available Formats

Differentially Private

Kamalika Chaudhuri Anand D. Sarwate

2-8 movie-ratings and dates for Alice reveals:

Example: UCSD Employee Salary Table

Position Gender Department Ethnicity Salary

Faculty Female CSE SE Asian -

One employee (Kamalika) fits description!

Correlation (R2 values), Alice’s DNA reveals:

Data Size Accuracy

Aggregate information computable

Data + Algorithm Outcome

Data + Algorithm Outcome

Participation of a person does not change outcome

Since a person has agency, they can decide

Smoking causes cancer

Data + Algorithm Outcome

Data + Algorithm Outcome

Participation of a person does not change outcome

Since a person has agency, they can decide

Randomness: Added by randomized algorithm A

Risk doesn’t increase if you don’t touch the data again

Total privacy loss is the sum of privacy losses

• Global Sensitivity Method [DMNS06]

Many others we will not cover [DL09, NRS07, …]

Example: f(D) = mean of data points in D

Given: A function f, sensitive dataset D

Given: A function f, sensitive dataset D

Given: A function f, sensitive dataset D

Given: A function f, sensitive dataset D

Given: A function f, sensitive dataset D

Global Sensitivity of f is S(f) = max | f(D) - f(D’)|

Global Sensitivity of f is S(f) = max | f(D) - f(D’)|

f(D) = Mean(D), where each record is a scalar in [0,1]

f(D) = Mean(D), where each record is a scalar in [0,1]

f(D) = Mean(D), where each record is a scalar in [0,1]

• Global Sensitivity Method [DMNS06]

Example: f(w, D) = accuracy of classifier w on dataset D

Suppose for any w,

Given validation data D, k classifiers w1, .., wk

Here, f(w, D) = classification accuracy of w on D

Statistical estimation: estimate a parameter or

Privacy – differential privacy makes no assumptions on

Accuracy – accuracy measured w.r.t a “true population

• Empirical Risk Minimization (ERM) is a common paradigm

• Produces a predictor w for a label/response y given a

• Typically use a convex loss function and regularizer to

• Kernel-based methods produce a classifier that is a

• Even adversary with black-box access to w could

• Good learning algorithms generalize to the population

• Stable learning algorithms generalize [BE02].

• Differential privacy can be interpreted as a form of stability

• Two parts of the same story:

• Learning using (convex) optimization uses three steps:

1. read in the data input perturbation

2. form the objective function objective perturbation

3. perform the minimization output perturbation

• We can try to introduce privacy in each step!

• Local privacy: data contributors sanitize data before collection.

• Classical technique: randomized response [W65].

• Interactive variant can be minimax optimal [DJW13].

• Input perturbation: add noise to the input data.

• Advantages: easy to implement, results in reusable