Download as pdf or txt
Download as pdf or txt
You are on page 1of 130

Differentially Private

Machine Learning
Theory, Algorithms, and Applications

Kamalika Chaudhuri Anand D. Sarwate


(UCSD) (Rutgers)
Logistics and Goals
• Tutorial Time: 2 hr (15 min break after first hour)
• What this tutorial will do:
• Motivate and define differential privacy
• Provide overview of common methods and tools to
design differentially private ML algorithms
• What this tutorial will not do:
• Provide detailed results on the state of the art in
differentially private versions of specific problems
Learning Outcomes
At the end of the tutorial, you should be able to:
• Explain the definition of differential privacy,
• Design basic differentially private machine learning
algorithms using standard tools,
• Try different approaches for introducing differential
privacy into optimization methods,
• Understand the basics of privacy risk accounting,
• Understand how these ideas blend together in more
complex systems.
Motivating
Differential Privacy
Sensitive Data

Medical Records

Genetic Data

Search Logs
AOL Violates Privacy
AOL Violates Privacy
Netflix Violates Privacy [NS08]

Movies%
User%1%
User%2%
User%3%

2-8 movie-ratings and dates for Alice reveals:


Whether Alice is in the dataset or not
Alice’s other movie ratings
High-dimensional Data is Unique

Example: UCSD Employee Salary Table

Position Gender Department Ethnicity Salary

Faculty Female CSE SE Asian -

One employee (Kamalika) fits description!


Simply anonymizing data is unsafe!
Disease Association Studies [WLWTZ09]

Cancer Healthy

Correlations Correlations

Correlation (R2 values), Alice’s DNA reveals:


If Alice is in the Cancer set or Healthy set
Simply anonymizing data is unsafe!
Statistics on small data sets is unsafe!

Privacy

Data Size Accuracy


Privacy Definition
The Setting

privacy barrier

Statistics
summary
statistic

Privacy-preserving synthetic
Data release
Private
data set sanitizer dataset

D ML model
Data Sanitizer Public
non-public
(sensitive) public
Property of Sanitizer
privacy barrier

summary
Statistics
statistic

Privacy-preserving synthetic
Private
data set sanitizer dataset
Data release
D ML model

Datanon-public
Sanitizer public
Public

Aggregate information computable


Individual information protected
(robust to side-information)
Differential Privacy

Data + Algorithm Outcome

Data + Algorithm Outcome

Participation of a person does not change outcome

Since a person has agency, they can decide


to participate in a dataset or not
Adversary

Prior Knowledge:
A’s Genetic profile
A smokes
Case 1: Study

A has
cancer
Adversary

Cancer
[ Study violates A’s privacy ]

Prior Knowledge:
A’s Genetic profile
A smokes
Case 1: Study

A has
cancer
Adversary

Cancer
[ Study violates A’s privacy ]

Case 2: Study

Prior Knowledge:
A probably
A’s Genetic profile has cancer
A smokes

Smoking causes cancer


[ Study does not violate privacy]
Differential Privacy [DMNS06]

Data + Algorithm Outcome

Data + Algorithm Outcome

Participation of a person does not change outcome

Since a person has agency, they can decide


to participate in a dataset or not
How to ensure this?

…through randomness

A( Data + )
Random have close
variables distributions
A(Data + )
How to ensure this?

A( Data + )
Random have close
variables distributions
A(Data + )

Randomness: Added by randomized algorithm A


Closeness: Likelihood ratio at every point bounded
Differential Privacy [DMNS06]

D D’
p[A(D) = t] p[A(D’) = t]
t
For all D, D’ that differ in one person’s value,
If A = ✏ -differentially private randomized algorithm, then:
p(A(D) = t)
sup log 0
✏
t p(A(D ) = t)
Differential Privacy [DMNS06]

D D’
p[A(D) = t] p[A(D’) = t]
t
For all D, D’ that differ in one person’s value,
If A = ✏ -differentially private randomized algorithm, then:
Max-divergence of p(A(D) = t)
sup log ✏
p(A(D)) and p(A(D’)) t
0
p(A(D ) = t)
Approx. Differential Privacy [DKM+06]

D D’
p[A(D) = t] p[A(D’) = t]
t
For all D, D’ that differ in one person’s value,
If A = (✏, ) -differentially private randomized algorithm, then:
h Pr(A(D) 2 S) i
max log ✏
S,Pr(A(D)2S)> Pr(A(D0 ) 2 S
Properties of
Differential Privacy
Property 1: Post-processing Invariance

privacy barrier
legitimate user 1 (✏1 , 1)
eriv ative
d
data
differentially non-private data derivative (✏1 ,
Private private post-
legitimate user 2 1)
data set algorithm processing

D data der
ivative
adversary (✏1 ,
(✏1 , 1) 1)

Risk doesn’t increase if you don’t touch the data again


Property 2: Graceful
privacy barrier
Composition
Preprocessing (" ,
1 1)

ATraining
1 (✏1 , ("
1) ,
2 2)
X X
Private ( ✏i , i)
data set A2 (✏2 , 2)
Cross-
validation
("3 , 3)
i i

D …
Testing ("4 , 4)
A

Total privacy loss is the sum of privacy losses


(Better composition possible — coming up later)
How to achieve
Differential Privacy?
Tools for Differentially Private
Algorithm Design

• Global Sensitivity Method [DMNS06]


• Exponential Mechanism [MT07]

Many others we will not cover [DL09, NRS07, …]


Global Sensitivity Method [DMNS06]

Problem:
Given function f, sensitive dataset D
Find a differentially private approximation to f(D)

Example: f(D) = mean of data points in D


The Global Sensitivity Method [DMNS06]

Given: A function f, sensitive dataset D


Define: dist(D, D’) = #records that D, D’ differ by
The Global Sensitivity Method [DMNS06]

Given: A function f, sensitive dataset D


Define: dist(D, D’) = #records that D, D’ differ by

Global Sensitivity of f:
The Global Sensitivity Method [DMNS06]

Given: A function f, sensitive dataset D


Define: dist(D, D’) = #records that D, D’ differ by

Global Sensitivity of f:
S(f) = | f(D) - f(D’)|

Domain(D)

D’
D
The Global Sensitivity Method [DMNS06]

Given: A function f, sensitive dataset D


Define: dist(D, D’) = #records that D, D’ differ by

Global Sensitivity of f:
S(f) = | f(D) - f(D’)|
dist(D, D’) = 1

Domain(D)

D’
D
The Global Sensitivity Method [DMNS06]

Given: A function f, sensitive dataset D


Define: dist(D, D’) = #records that D, D’ differ by

Global Sensitivity of f:
S(f) = max | f(D) - f(D’)|
dist(D, D’) = 1

Domain(D)

D’
D D’
D
Laplace Mechanism

Global Sensitivity of f is S(f) = max | f(D) - f(D’)|


dist(D, D’) = 1

Output where
✏-differentially
private

Laplace distribution:
Gaussian Mechanism

Global Sensitivity of f is S(f) = max | f(D) - f(D’)|


dist(D, D’) = 1

Output where
S(f )
Z⇠ N (0, 2 ln(1.25/ )) (✏, )-differentially

private
Example 1: Mean

f(D) = Mean(D), where each record is a scalar in [0,1]


Example 1: Mean

f(D) = Mean(D), where each record is a scalar in [0,1]


Global Sensitivity of f = 1/n
Example 1: Mean

f(D) = Mean(D), where each record is a scalar in [0,1]


Global Sensitivity of f = 1/n
Laplace Mechanism:
1
Output where Z⇠
n✏
Lap(0, 1)
How to get Differential Privacy?

• Global Sensitivity Method [DMNS06]


• Two variants: Laplace and Gaussian
• Exponential Mechanism [MT07]
Many others we will not cover [DL09, NRS07, …]
Exponential Mechanism [MT07]

Problem:
Given function f(w, D), Sensitive Data D
Find differentially private approximation to


w = argmax f (w, D)
w

Example: f(w, D) = accuracy of classifier w on dataset D


The Exponential Mechanism [MT07]

Suppose for any w,


0
|f (w, D) f (w, D )|  S
when D and D’ differ in 1 record. Sample w from:
✏f (w,D)/2S
p(w) / e
for ✏-differential privacy.
argmax f(w, D)

f(w, D) Pr(w)

w w
Example: Parameter Tuning

Given validation data D, k classifiers w1, .., wk


(privately) find the classifier with highest accuracy on D

Here, f(w, D) = classification accuracy of w on D


For any w, any D and D’ that differ by one record,
0 1
|f (w, D) f (w, D )| 
|D|
So, the exponential mechanism outputs wi with prob:
✏|D|f (wi ,D)/2
Pr(wi ) / e
[CMS11, CV13]
Summary

— Motivation
— What is differential privacy?
— Basic differential privacy algorithm design tools:
— The Global Sensitivity Method
— Laplace Mechanism
— Gaussian Mechanism

— Exponential Mechanism
Differential privacy in
estimation and prediction
Estimation and prediction problems
privacy barrier
f (w, ·) risk functional
Private (", )
data set
DP estimate of
D ŵ private estimator
argmin f (w, D)
sample size n w E[f (ŵ, z)] E[f (w⇤ , z)]

Statistical estimation: estimate a parameter or


predictor using private data that has good expected
performance on future data.
Goal: Good privacy-accuracy-sample size tradeoff
Privacy and accuracy make different
assumptions about the data
privacy barrier
f (w, ·) risk functional
Private (", )
data set
DP estimate of
D ŵ private estimator
argmin f (w, D)
sample size n w E[f (ŵ, z)] E[f (w⇤ , z)]

Privacy – differential privacy makes no assumptions on


the data distribution: privacy holds unconditionally.

Accuracy – accuracy measured w.r.t a “true population


distribution”: expected excess statistical risk.
Statistical Learning as Risk Minimization

Xn
⇤ 1
w = argmin `(w, (xi , yi )) + R(w)
w n i=1

• Empirical Risk Minimization (ERM) is a common paradigm


for prediction problems.

• Produces a predictor w for a label/response y given a


vector of features/covariates x.

• Typically use a convex loss function and regularizer to


“prevent overfitting.”
Why is ERM not private?
w⇤
+
+ + –+ easy for adversary to tell the
difference between D and D’
– + + Private

– – data set
D or D0 ?
– – D
w
0⇤
w
+
+ + –+
adversary
– + Private

– – data set

+ – – D 0
[CMS11, RBHT12]
Kernel learning: even worse

• Kernel-based methods produce a classifier that is a


function of the data points.

• Even adversary with black-box access to w could


potentially learn those points.

+
+ + –+
n
X
w(x) = ↵i k(x, xi ) – + +
i=1

– –
– –
[CMS11]
Privacy is compatible with learning

• Good learning algorithms generalize to the population


distribution, not individuals.

• Stable learning algorithms generalize [BE02].

• Differential privacy can be interpreted as a form of stability


that also implies generalization [DFH+15,BNS+16].

• Two parts of the same story:


Privacy implies generalization asymptotically.
Tradeoffs between privacy-accuracy-sample size for finite n.
Revisiting ERM
Xn
⇤ 1
w = argmin `(w, (xi , yi )) + R(w)
w n i=1

• Learning using (convex) optimization uses three steps:

1. read in the data input perturbation

2. form the objective function objective perturbation

3. perform the minimization output perturbation

• We can try to introduce privacy in each step!


Privacy in ERM: options
input
perturbation
non-private
input preprocessing
perturbation

input
private
private sanitized
perturbation
database
database non-private
sanitized dataset non-non-
input
DPDP
optimization
database
optimization
preprocessing

private
private
objective
perturbation algorithm
learning
D input
perturbation
(",(",)w
)

noise addition
input
perturbation
or
randomization
ŵoutput
input
privacy (", ) privacy
privacy
(", ) barrier barrier
barrier
Local Privacy
input
perturbation

input
perturbation

input
perturbation
sanitized non-
input database private
perturbation algorithm

input
perturbation

input
perturbation

privacy
(", ) barrier

• Local privacy: data contributors sanitize data before collection.

• Classical technique: randomized response [W65].

• Interactive variant can be minimax optimal [DJW13].


Input Perturbation

private sanitized
database dataset non-
DP
private
preprocessing
learning

D (", )

ŵinput
privacy
barrier

• Input perturbation: add noise to the input data.

• Advantages: easy to implement, results in reusable


sanitized data set.
[DJW13,FTS17]
Output Perturbation
non-private
preprocessing

private
database non-private
optimization

D w ⇤

noise addition
or
randomization
ŵoutput
(", ) privacy
barrier

• Compute the minimizer and add noise.

• Does not require re-engineering baseline algorithms

Noise depends on the sensitivity of the argmin.


[CMS11, RBHT12]
Objective Perturbation

non-private
preprocessing

private
database
DP
optimization
ŵobjective
D (", )

privacy
barrier
Objective Perturbation
Xn
1
J(w) = `(w, (xi , yi )) + R(w)
n i=1

A. Add a random term to the objective:

ŵpriv = argmin J(w) + w> b


w

B. Do a randomized approximation of the objective:

ˆ
ŵpriv = argmin J(w)
w

Randomness depends on the sensitivity properties of J(w).


[CMS11, ZZX+12]
Sensitivity of the argmin

⇤ >
ŵpriv = argmin J(w) +
w = argmin J(w) w b
w w


• Non-private optimization solves rJ(w) = 0 =) w

• Generate vector analogue of Laplace: b ⇠ p(z) / e "/2kzk

• Private optimization solves rJ(w) = b =) wpriv

• Have to bound the sensitivity of the gradient.


Theoretical bounds on excess risk

Same important parameters:


output
input perturbation
objective perturbation
! • privacy parameters (ε, δ)
p
d log(1/ )
Õ • data dimension d
n"
• data bounds kxi k  B
✓✓ ◆◆
d • analytical properties of the loss
ÕÕ 3/2 (Lipschitz, smoothness)
n"n"
(quadratic loss) • regularization parameter
[CMS11, BST14]
[FTS17]
Typical empirical results
In general:
[CMS11]
• Objective perturbation
empirically outperforms output
perturbation.

• Gaussian mechanism with (ε, δ)


guarantees outperform Laplace -
like mechanisms with ε-
guarantees.
[JT14]
• Loss vs. non-private methods is
very dataset-dependent.
Gaps between theory and practice

• Theoretical analysis is for fixed privacy parameters –


how should we choose them in practice?

• Given a data set, can I tell what the privacy-utility-


sample-size tradeoff is?

• What about more general optimization problems/


algorithms?

• What about scaling (computationally) to large data sets?


Summary

• Training does not on its own guarantee privacy.

• There are many ways to incorporate DP into prediction


and learning using ERM with different privacy-accuracy-
sample size tradeoffs.

• Good DP algorithms should generalize since they learn


about populations, not individuals.

• Theory and experiment show that (ε, δ)-DP algorithms


have better accuracy than ε-DP algorithms at the cost of
a weaker privacy guarantee.
Break
Differential privacy and
optimization algorithms
Scaling up private optimization

• Large data sets are challenging for optimization:


batch methods not feasible

• Using more data can help our tradeoffs look better:


better privacy and accuracy

• Online learning involves multiple releases:


potential for more privacy loss

Goal: guarantee privacy using the optimization algorithm.


Stochastic Gradient Descent

• Stochastic gradient descent (SGD) is a moderately


popular method for optimization

• Stochastic gradients are random


already noisy already private?

• Optimization is iterative
intermediate results leak information
Non-private SGD
Xn
1
J(w) = `(w, (xi , yi )) + R(w)
n i=1

w0 = 0 • select a random data point


For t = 1, 2, . . . , T
• take a gradient step
it ⇠ Unif{1, 2, . . . , n}
gt = r`(wt 1 , (xit , yit )) + rR(wt 1)
wt = ⇧W (wt 1 ⌘ t gt )
ŵ = wT
Private SGD with noise
Xn
1
J(w) = `(w, (xi , yi )) + R(w)
n i=1

w0 = 0 • select random data point


For t = 1, 2, . . . , T
• add noise to gradient
it ⇠ Unif{1, 2, . . . , n}
zt ⇠ p(", ) (z)
ĝt = zt + r`(wt 1 , (xit , yit )) + rR(wt 1)
wt = ⇧W (wt 1 ⌘t ĝt )
[SCS15]
ŵ = wT
Choosing a noise distribution

“Laplace” mechanism Gaussian mechanism


✓ ✓ ◆◆
("/2)kzk i.i.d. log(1/ )
p(z) / e p(z) ⇠ N 0, Õ
"2
" DP (", ) DP
• Have to choose noise according to the sensitivity of the
gradient:
0
max0 max krJ(w; D) rJ(w; D )k
D,D w

• Sensitivity depends on the data distribution, Lipschitz


parameter of the loss, etc.
Private SGD with randomized selection
Xn
1
J(w) = `(w, (xi , yi )) + R(w)
n i=1

• select random data point


w0 = 0
For t = 1, 2, . . . , T • randomly select unbiased
it ⇠ Unif{1, 2, . . . , n} gradient estimate
gt = r`(wt 1 , (xit , yit )) + rR(wt 1)
ĝt ⇠ p(", ),g (z)
wt = ⇧W (wt 1 ⌘t ĝt )
[DJW13]
ŵ = wT
Randomized directions
[DJW13] L 1 kgk
=
kgk
g w.p.
2
+
2L Select hemisphere in
✏ direction of gradient
e or opposite.
1 + e✏ Pick uniformly from
1 the hemisphere and
B
1 + e✏ take a step

• Need to have control of gradient norms: kgk  L

• Keep some probability of going in the wrong direction.


Why does DP-SGD work?
zt zt =
L
kgk
g w.p.
1 kgk
2
+
2L

e✏
g+

1 + e✏
1
ĝ =

B
g 1 + e✏

Noisy Gradient Random Gradient

Choose noise distribution using Randomly select direction biased


the sensitivity of the gradient. towards the true gradient.

Both methods

• Guarantee DP at each iteration.


• Ensure unbiased estimate of g to guarantee convergence
Making DP-SGD more practical

“SGD is robust to noise”

• True up to a point — for small epsilon (more privacy), the


gradients can become too noisy.
2
• Solution 1: more iterations ([BST14]: need O(n ))

• Solution 2: use standard tricks: mini-batching, etc. [SCS13]

• Solution 3: use better analysis to show the privacy loss is


not so bad [BST14][ACG+16]
Randomly sampling data can amplify
privacy
privacy barrier

2"✏
Differentially
private algorithm
Private
data set
subsample
D n

• Suppose we have an algorithm A which is ε-differentially


private for ε ≤ 1.

• Sample γn entries of D uniformly at random and run A


on those.

• Randomized method guarantees 2γε-differential privacy.


[BBKN14,BST14]
Summary

• Stochastic gradient descent can be made differentially


private in several ways by randomizing the gradient.

• Keeping gradient estimates unbiased will help ensure


convergence.

• Standard approach for variance reduction/stability (such


as minibatching) can help with performance.

• Random subsampling of the data can amplify privacy


guarantees.
Accounting for total
privacy risk
Measuring total privacy loss
privacy barrier
legitimate user 1 (✏1 , 1)
ivative
a d e r
dat
differentially non-private data derivative (✏1 ,
Private private post-
legitimate user 2 1)
data set algorithm processing

D data der
ivative
adversary (✏1 ,
(✏1 , 1) 1)

Post processing invariance: risk doesn’t increase if you don’t


touch the data again
• more complex algorithms have multiple stages
all stages have to guarantee DP
• need a way to do privacy accounting: what is lost over time/
multiple queries?
A simple example
privacy barrier
Preprocessing ("1 , 1)

Training ("2 , 2)
Private
data set Cross-
validation
("3 , 3)
D
Testing ("4 , 4)
Composition property of
differential privacy
Basic composition: privacy loss is additive:

• Apply R algorithms with (✏i , i ) : i = 1, 2, . . . , R

• Total privacy loss: !


R
X R
X
✏i , i
i=1 i=1

• Worst-case analysis: each result exposes the worst


privacy risk.
What composition says about
multi-stage methods
privacy barrier
Preprocessing ("1 , 1)

Training ("2 , 2)
Private
data set Cross-
validation
("3 , 3)
D
Testing ("4 , 4)

Total privacy loss is the sum of the privacy losses…


An open question: privacy
allocation across stages
Compositions means we have a privacy budget.

How should we allocate privacy risk across different stages


of a pipeline?

• Noisy features + accurate training?

• Clean features + sloppy training?

It’s application dependent! Still an open question…


A closer look at ε and δ

Gaussian noise of a given variance produces a spectrum of


(ε, δ) guarantees:

p(⌧2 |D)
1
2
|D)
p(⌧1p(⌧ 2 |D)
 exp(" 2 )
p(⌧ |D)
p(⌧21|D 0)
0
 exp("1 )
p(⌧1 |D )
p(⌧2 |D00 )
p(⌧1 |D )
⌧1 ⌧2
Privacy loss as a random
variable
Spectrum of (ε, δ)
guarantees means we can
trade off ε and δ when
analyzing a particular
mechanism.

Actual privacy loss is a random variable that depends on D:


p(A(D) = t)
ZD,D0 = log 0
w.p. p(A(D) = t)
p(A(D ) = t)
Random privacy loss
p(A(D) = t)
ZD,D0 = log w.p. p(A(D) = t)
p(A(D0 ) = t)

• Bounding the max loss over (D,D’) is still a random


variable.

• Sequentially computing functions on private data is like


sequentially sampling independent privacy losses.

• Concentration of measure shows that the loss is much


closer to its expectation.
Strong composition bounds
k times
z }| {
(", ), (", ), . . . , (", ) • Given only the (ε, δ)
guarantees for k
algorithms operating
on the data.

• Composition again
(k 2i)", 1 (1 )k (1 i) gives a family of
(ε, δ) tradeoffs: can
quantify privacy loss
Pi 1 k by choosing any
`=0 ` e(k `)" e(k 2i+`)"
valid (ε, δ) pair.
i =
(1 + e" )k
[DRV10, KOV15]
Moments accountant
privacy barrier
Preprocessing ("1 , 1)

A
Training1 ("2 , 2)
Private
data set A 2 A (✏, )
Cross-
validation
("3 , 3)
D …
Testing ("4 , 4)
Basic Idea: Directly calculate parameters (ε,δ)
from composing a sequence of mechanisms
More efficient than composition theorems
[ACG+16]
How to Compose Directly?
Given datasets D and D’ with one different record,
mechanism A, define privacy loss random variable as:
p(A(D) = t)
ZD,D0 = log 0
, w.p. p(A(D) = t)
p(A(D ) = t)

- Properties of ZD,D’
related to privacy loss of A

- If max absolute value of ZD,D’


over all D, D’ is ε, then A is
(ε,0)-differentially private
How to Compose Directly?
Given datasets D and D’ with one different record,
mechanism A, define privacy loss random variable as:
p(A(D) = t)
ZD,D0 = log 0
, w.p. p(A(D) = t)
p(A(D ) = t)

Challenge: To reason about


the worst case over all D, D’

Key idea in [ACG+16]: Use


moment generating functions
Accounting for Moments…
privacy barrier
Preprocessing ("1 , 1)

A
Training1 ("2 , 2)
Private
data set A 2 A (✏, )
Cross-
validation
("3 , 3)
D …
Testing ("4 , 4)
Three Steps:
1. Calculate moment generating functions for A1, A2, ..
2. Compose
3. Calculate final privacy parameters
[ACG+16]
1. Stepwise Moments
privacy barrier
Preprocessing ("1 , 1)

A
Training1 ("2 , 2)
Private
data set A 2 A
Cross-
validation
("3 , 3)
D …
Testing ("4 , 4)
Define: Stepwise Moment at time t of At at any s:
↵At (s) = sup log E[e sZD,D0
] (D and D’ differ
D,D 0 by one record)

[ACG+16]
2. Compose
privacy barrier
Preprocessing ("1 , 1)

A
Training1 ("2 , 2)
Private
data set A 2 A
Cross-
validation
("3 , 3)
D …
Testing ("4 , 4)

Theorem: Suppose A = (A1, …, AT). For any s:


T
X
↵A (s)  ↵At (s)
t=1
[ACG+16]
3. Final Calculation
privacy barrier
Preprocessing ("1 , 1)

A
Training1 ("2 , 2)
Private
data set A 2 A (✏, )
Cross-
validation
("3 , 3)
D …
Testing ("4 , 4)

Theorem: For any ε, mechanism A is (ε,δ)-DP for


= min exp(↵A (s) s✏)
s
Use theorem to find best ε for a given δ from closed form
or by searching over s1, s2, .., sk [ACG+16]
Example: composing Gaussian
mechanism("1 , 1 )
privacy barrier
Preprocessing

A
Training1 ("2 , 2)
Private
data set A 2 A (✏, )
Cross-
validation
("3 , 3)
D …
Testing ("4 , 4)

Suppose At answers a query with global sensitivity 1 by


adding N(0, 1) noise
1. Stepwise Moments
privacy barrier
Preprocessing ("1 , 1)

A
Training1 ("2 , 2)
Private
data set A 2 A
Cross-
validation
("3 , 3)
D …
Testing ("4 , 4)

Suppose At answers a query with global sensitivity 1 by


adding N(0, 1) noise
s(s + 1)
Simple algebra gives for any s: ↵At (s) =
2
2. Compose
privacy barrier
Preprocessing ("1 , 1)

A
Training1 ("2 , 2)
Private
data set A 2 A
Cross-
validation
("3 , 3)
D …
Testing ("4 , 4)

Suppose At answers a query with global sensitivity 1 by


adding N(0, 1) noise
XT
T s(s + 1)
↵A (s)  ↵At (s) =
t=1
2
3. Final Calculation
privacy barrier
Preprocessing ("1 , 1)

A
Training1 ("2 , 2)
Private
data set A 2 A (✏, )
Cross-
validation
("3 , 3)
D …
Testing ("4 , 4)

Find lowest δ for a given ε (or vice versa) by solving:


= min exp(T s(s + 1)/2 s✏)
s
In this case, solution can be found in closed form.
How does it compare?

[DRV10]
(better than linear)
epsilon

Moments
Accountant

#rounds of composition
[ACG+16]
How does it compare
on real data?

EM for MOG
with Gaussian
Mechanism
= 10 4

[PFCW17]
Summary
• Practical machine learning looks at the data many times.

• Post-processing invariance means we just have to track


the cumulative privacy loss.

• Good composition methods use the fact that the actual


privacy loss may behave much better than the worst-case
bound.

• The Moments Accountant method tracks the actual


privacy loss more accurately: better analysis for better
privacy guarantees.
Applications to modern
machine learning
When is differential privacy practical?

Differential privacy is best suited for understanding


population-level statistics and structure:
• Inferences about the population should not depend
strongly on individuals.
• Large sample sizes usually mean lower sensitivity and
less noise.
To build and analyze systems we have to leverage post-
processing invariance and composition properties.
Differential privacy in practice

Google: RAPPOR for


tracking statistics in
Chrome.
Apple: various iPhone
usage statistics.
Census: 2020 US Census
will use differential privacy.

mostly focused on count and average statistics


Challenges for machine learning
applications

Differentially private ML is complicated because real ML


algorithms are complicated:
• Multi-stage pipelines, parameter tuning, etc.
• Need to “play around” with the data before
committing to a particular pipeline/algorithm.
• “Modern” ML approaches (= deep learning) have
many parameters and less theoretical guidance.
Some selected examples

posterior
original
L 1 , L2 , . . .
select random
lot new network
Private weights
data set

D
x L1 , x L2 , . . . compute
gradients
✓T

processed
posterior
moments clip and update
accountant add noise parameters

(", ) privacy
barrier

For today, we will describe some recent examples:


1. Differentially private deep learning [ACG+16]
2. Differential privacy and Bayesian inference
Differential privacy and deep learning

Main idea: train a deep


network using differentially
private SGD and use
moments accountant to
track privacy loss.
Additional components:
gradient clipping,
minibatching, data
augmentation, etc.

[ACG+16]
Overview of the algorithm

L 1 , L2 , . . .
select random
lot new network
Private weights
data set

D
x L1 , x L2 , . . . compute
gradients
✓T
moments clip and update
accountant add noise parameters

(", ) privacy
barrier
[ACG+16]
Effectiveness of DP deep learning

Empirical results on MNIST and CIFAR:


• Training and test error come close to baseline non-
private deep learning methods.
• To get moderate loss in performance, epsilon and
delta are not “negligible”
[ACG+16]
Moving forward in deep learning

This is a good proof of concept for differential privacy for


deep neural networks. There are lots of interesting
ways to expand this.
• Just used one NN model: what about other
architectures? RNNs? GANs?
• Can regularization methods for deep learning (e.g.
dropout) help with privacy?
• What are good rules of thumb for lot/batch size,
learning rate, # of hidden units, etc?
[ACG+16]
Differentially Private
Bayesian Inference

Data X = { x1, x2, … }


Model Class ⇥ } Related through
likelihood p(x|✓)

+ =
Prior ⇡(✓) Data X Posterior p(✓|X)

Find differentially private approx to posterior


Differentially Private
Bayesian Inference

• General methods for private posterior


approximation
• A Special Case: Exponential Families
• Variational Inference
How to make posterior private?
Option 1: Direct posterior sampling [DMNR14]

Not differentially private except under restrictive


conditions - likelihood ratios may be unbounded!

[GSC17] Answer changes under a new relaxation


Rényi differential privacy [M17]
How to make posterior private?

Option 2: One Posterior Sample (OPS) Method [WFS15]

original posteriors processed posteriors

1. Truncate posterior so that likelihood ratio is


bounded in the truncated region.

2. Raise truncated posterior to a higher temperature


How to make posterior private?
Option 2: One Posterior Sample (OPS) Method:

Advantage: General

Pitfalls:
— Intractable - only exact
distribution private
— Low statistical efficiency
even for large n
How to make posterior private?
Option 3: Approximate the OPS distribution via
Stochastic Gradient MCMC [WFS15]

original posteriors processed posteriors


Advantage: Noise added during stochastic gradient
MCMC contributes to privacy

Disadvantage: Statistical efficiency lower than exact OPS


Differentially Private
Bayesian Inference

• General methods for private posterior


approximation
• A Special Case: Exponential Families
• Variational Inference
Exponential Family Posteriors

(Non-private) posterior comes from exp. family:


> P
⌘(✓) ( T (xi )) B(✓)
p(✓|x) / e i

given data x1, x2, …


Posterior depends on data through sufficient statistic T
Exponential Family Posteriors

(Non-private) posterior comes from exp. family:


> P
⌘(✓) ( T (xi )) B(✓)
p(✓|x) / e i

given data x1, x2, …, sufficient statistic T


Private Sampling:
X
1. If T is bounded, add noise to T (xi ) to get private
version T’ i

2. Sample from the perturbed posterior:


⌘(✓)> T 0 B(✓)
p(✓|x) / e
[ZRD16, FGWC16]
How well does it work?
5
x 10
−1.5

−2

−2.5

Test−set log−likelihood
−3

−3.5

−4

Non−private HMM
−4.5 Non−private naive Bayes
Laplace mechanism HMM
OPS HMM (truncation multiplier = 100)
−5
−1 0 1
10 10 10
Epsilon (total)

Statistically efficient
Performance worse than non-private, better than OPS
Can do inference in relatively complex systems by building
up on this method — eg, time series clustering in HMMs
Differentially Private
Bayesian Inference

• General methods for private posterior


approximation
• A Special Case: Exponential Families
• Variational Inference
Variational Inference

Key Idea: Start with


a stochastic
variational inference
method, and make
each step private by
adding Laplace noise.
Use moments
accountant and
subsampling to track
privacy loss.

[JDH16, PFCW16]
Summary

• Two examples of differentially private complex


machine learning algorithms

• Deep learning

• Bayesian inference
Summary, conclusions,
and where to look next
Summary

1. Differential privacy: basic definitions and mechanisms

2. Differential privacy and statistical learning: ERM and


SGD

3. Composition and tracking privacy loss

4. Applications of differential privacy in ML: deep learning


and Bayesian methods
Things we didn’t cover…

• Synthetic data generation.

• Interactive data analysis.

• Statistical/estimation theory and fundamental limits

• Feature learning and dimensionality reduction

• Systems questions for large-scale deployment

• … and many others…


Where to learn more

Several video lectures and other more technical


introductions available from the Simons Institute for the
Theory of Computing:

https://simons.berkeley.edu/workshops/bigdata2013-4

Monograph by Dwork and Roth:

http://www.nowpublishers.com/article/Details/TCS-042
Final Takeaways

• Differential privacy measures the risk incurred by


algorithms operating on private data.

• Commonly-used tools in machine learning can be made


differentially private.

• Accounting for total privacy loss can enable more


complex private algorithms.

• Still lots of work to be done in both theory and practice.


Thanks!
This work was supported by

National Science Foundation (IIS-1253942, CCF-1453432, SaTC-1617849)

National Institutes of Health (1R01DA040487-01A1)

Office of Naval Research (N00014-16-1-2616)

DARPA and the US Navy (N66001-15-C-4070)

Google Faculty Research Award

You might also like