Lecture 4

ELEN 6820
Speech and audio signal processing

Instructor: Nima Mesgarani (nm2764)
3 credits
Office hours: Wednesday 4-5pm
TA:Yi Luo
Pattern Classification
Classification
Goal: To classify objects (or patterns) into categories (or classes)
! Feature ! !
Extraction Classifier
Observation Feature Vector Class

s x ωi
What
•Types is machine
of Problems: learning?
•
1. Can you learn
Supervised: Classesany
areone-to-one mapping
known beforehand, given
and data enough
samples of
training
each class data?
are available
2. Unsupervised: Classes (and/or number of classes) are not known
• Supervised vs. Unsupervised
beforehand, and must be inferred from data
• Deterministic vs. Stochastic
Generative vs. Discriminative
•6.345 Automatic Speech Recognition Pattern Classification 2
Topics
• Bayes Theorem
• Dimensionality reduc6on with PCA and LDA
• K-Means unsupervised clustering
• Gaussian density models
• (Support Vector Machines)
• Neural Network Models: linear neuron, logis6c neuron, back

propaga6on
Support Vector Machines (SVM)
• Can be made nonlinear using kernels

• Depends only on training points near the decision boundary,
aka Support Vectors
• Unique, op6mal solu6on
Neural Networks
• Inspired by biological neural networks
• Not just for paPern recogni6on anymore, NNs are used

for signal processing, representa6on, separa6on, and
compression
• Are used for acous6c modeling (feature to phonemes),

language modeling (words to text), and end-to-end
models
A typical cor6cal neuron
• Cell body, many inputs (synapses), one output (axon)
• When a neuron received enough inputs, it generates a spike
• Spikes are fixed amplitude and shape, only their frequency

(spike per second) varies
• Synapses are the inputs of neurons. Can have posi6ve or

nega6ve effect, Can change efficacy (adapt, learn, etc.).
• The exact learning rule is not very clear

A shallow look at deep neural networks:
tivation Functionsbiologically inspired
Weight 2
Input 2
- Computes f(x) =
Activation Functions
Weight 1
- Does not saturate
Input 1 - Very computation
- Converges much
sigmoid/tanh in p
- Actually more bio
than sigmoid
ei Li & Justin Johnson & Serena Yeung Lecture 6 - 14

ReLU April 19,
(Rectified Linear Unit)
Li, Johnson,Yeung, 2018
Neural network model
(RosenblaP, 1958)
Mul6layer perceptron (MLP)
Input Output
layer layer
Adding a hidden layer: universal approximator
(Backpropaga6on, 1986)
Input Hidden Output

layer layer layer
Deep (mul6layer) neural network
W1 W2 W3 W4
Input Hidden Hidden Hidden Output

layer layer 1 layer 2 layer 3 layer
• Learn synap6c weights to minimize predic6on error

• Can be highly regularized and op6mized
History of neural networks
1943 McCulloch and PiPs proposed the McCulloch-PiPs neuron model
1949 Hebb published his book The Organiza6on of Behaviour,
1958 RosenblaP introduced the simple single layer networks called Perceptrons
1969 Minsky and Papert’s book Perceptrons demonstrated the limita6on of
single layer perceptrons
1980 Grossberg introduced his Adap6ve Resonance Theory (ART)
1982 Hopfield published a series of papers on Hopfield networks
1982 Kohonen developed the Self-Organizing Feature Maps
1986 Back-propaga6on learning algorithm for mul6-layer perceptrons was re-
discovered, and the whole field took off again
1990s ART-variant networks were developed
1990s Radial Basis Func6ons were developed
2010s Deep neural networks, Convolu6onal neural networks, etc.
Model neurons
• Many ways to model a neuron, one example is logis6c
neuron (smooth, differen6able)
− x −1
y = (1 + e )
dy − x −2 −x
= −(1 + e ) (−e ) !
dx
Enables probabilis6c interpreta6on !
( x− µ )2
1 − 2
Ac6va6on func6ons
Fei-Fei Li & Justin Johnson & Serena Yeung

Gradient Descent
• Learning rate: step size
• Too large, unstable
• Too small, slow convergence

Learning with a linear neuron
y
y = ∑ wi xi = WT X ! T
y=∑ i wi xi = W X !
1i For one sample: +
y = ∑1w∑= − y ) !
n n 2
i xi = n W nX2 !
E (t T
E =i 2 ∑ n (t − y ) !
2 n (training) n wi
∂E1 1 n ∂ynn 2dEn n
E ∂E= ∑ = 1(t∑−∂yy n)dE! n = − ∑ nxi (t n n
− ny )
n
xi
∂w2i =n 2 ∑n ∂wn dyn = − ∑n xi (t − y ) !
n
∂wi 2 n ∂wn dyn n !

∂E 1 ∂E ∂E
∂y dE
∂w
Δw=i = −∑ ε
Δwi =2−ε ∂w = dy n
=
∑ ∑ nε
ε=
x n i− ∑
x n n
(t
(t
n
−
−
x
y i ) !
) !
n (t − y )
n nn
y n
n ∂w i
i
i n n !
z == bb ++ ∑
i n
∂Exxw i wi ! !
Δwi = −ε∑i i =i ∑ ε xi (t − y ) !
z n n n
i∂w n
i
1
z =yy =b= + ∑−−xzzi w
1 !i !
!
11++ ee
Training with gradient descent
• Correla6on among features
• Online vs. batch training
• Learning rate
∂E ∂y
Δw
z∂w
∂w = =
ib==+−ε∑
= 1
∑ ∑ ∑ x i∂ww
∂w
= n∑
!
dEε xn
n =∂w
(t= −− ∑
∑y∑
n−i n xnn (t n − y n )
= − )
x !
ix (t
i (t − −
y y) )
dy+ ∑ xi winn !n
n n
∂w
i
∂wiii 2 i nnn ∂w22 i n dy
i z =n b dy n i
! !
!
Learning
z = b + ∑ ∂E
∂E
x i i with n
∂Ew !
∑
i
a nlogis6c neuron
Δw Δw == −1
− ε ε ! y==∑ = = ∑ ε ε n (t n − y n ) ! !
ε xi i(t! (t −−y y) !)
n n n n
y =ii = −ε− z∂w
Δw i i 1 x x
i
1 +1e ∂w ∂wii i 1nn n+ e − z
y∂z= !
zz = == bbb1=+++xe∑ ∑ −
!
z xx xi
i
w
w w i ∂z!
i i i! ! = x !
∂w∂zi ∂wi
ii i
= 1xi !
ii y
yy∂z∂w= 11 ! ∂z
== 1=+we −−!−zzz !!
i
= wi !
∂z
∂xi = wi ! 1
1 +
+ e e
i ∂xi
∂z
∂z
∂z
∂x
f(.)
i = x ! dy
dy == xxii i!! = y(1 − y) !
∂wdyii= y(1 − y) !dz
∂w
∂w
dz i= y(1 − y) ! ∂y ∂z dy wi
∂z
∂zdz
∂z
∂y = w ∂z
===ww∂z !
ii ! dy ∂w dy = = xi y(1 − y) !
x
∂x ∂y ! = ∂w
ix y(1 − i y) !
dz i
∂x
∂x
∂w ii =
∂w
i
dz = xi y(1 − y) !n
i
∂w ii ∂w i dz ∂E ∂y ∂E
dy
dy
∂E
dy∂E =
i
= y(1
y(1 −
−
i
∂y
∂y y)
y) n !∂w
n
!
∂E ∂E
= ∑ n ∂wni ∂y
n
= − ∑ x n n
i y (1 − y n
)(t n
− y n
) !
dz
dz
∂w
dz
∂wii
= =∑
= ∑
y(1 −
∂w
y)
!
!
∂y
i
n= =
− ∑− ∑ x x
y nn n
i y
(1 − (1y−n n nn
y
)(t −
)(t nn
y −
) ! y n
) !
nn ∂w i i∂y
n i
∂y ∂z
∂y = ∂z dyback!propagation:!dy n n
!! ∂y = ∂z dy= = xxii y(1y(1 − − y)y) !!

∂wi = ∂w
∂w ∂wi dz dz != xi y(1 − y) !
Mul6layer network and back propaga6on
yj
• Mul6ple layers of nonlinear
j
zj transforma6on
• Can show that “any”
wij func6on can be modeled
• Training using random
i
yi perturba6on (e.g. par6cle
filtering)
• Can randomly perturb the
weights, or ac6va6ons
• Inefficient! back
propaga6on
E∂w = i ∑∂w (t ij −dzy j ) ! 1
back!propagation:!
! 2 j E = ∑ (t j − y j ) ! 2
∂E 1 =∑ ∂y n
∂E
2 n = − ∑ xi y (1 − y )(t − y ) !
!∂E Error 2
deriva6ve
n j n w.r.t output: n n n
E= ∂w2i ∑ = −(t
(t
1 n j i j !2 −
∂w − y y )∂y ) ! ∂E
= i j ∑ (t j − y j ) !
j j
E∂y = −(t j − y j ) !
n
! 2 j ∂y j
∂E ∂E dy j ∂E ∂E
∂E ==−(t j − y j ) =! y j (1 −∂E
back!propagation:! y j ) dy!j ∂E ∂E yj
∂y∂z
Error
! i =
deriva6ve −(t
dz ∂y
w.r.t − total
y ) !
input received= by neuron J: = y j (1 − y j )
∂y !
∂y
j j j j j ∂z j dz j ∂y j j
∂y j Layer J
∂E
∂E =1 dz i dy ∂E
j j ∂E ∂E ∂E
E ∂E == ∑∑dy(tj j ∂E − =y j=y) j∑ 2
! −w
(1 ∂Eyij j ) !∂E ! j ∂E
dz ∂E j
∂y∂zi j 2=dz j j dy
j ∂y ∂z
j = y (1
jj ∂y − ∂z=
y ∑
∂y
) j
dy ! ∂z
= ∑ wij
∂z
! z j
∂z j dzdz j ∂y j ∂y j
i j jj
i j i j j j wij
∂E
! ∂E ∂Ea weight connec6ng ∂E Layer I
∑ ∑
Error deriva6ve jw.r.t neurons i & j:
= = −(t − =
y ) ! w! ij !
∂y∂E∂E∂y ∂z dz ∂z
∂Ej ∂E ∂E ∂z ∂E y
==ij ∑ ji j = = ∑
j j i
dy ∂E
w j ∂z ! ∂E ∂E i
i
∂y dy ∂z y j
! ij =
∂z
j
= yi !
! ∂wiji∂E ∂w j dy∂z ∂zj ∂w
i
ij j ∂E
i j j j ij ∂w
j
∂E ij ∂z j ∂z j
!∂E ∂z= ∂E =∂E y j (1 − y j ) !
! deriva6ve
∂z j dz ∂y ! ∂y j
Error
∂E = j
∂z j =j y
w.r.t
∂E
hidden outputs:
∂E !
∂wjij =∂w
y z j ij ∂z y ! i
∂z y z y !
dz j ∂E= yi j ! ∂E
j
∂E i j j j i
! ! ∂w ij = ∑ ∂wij ∂z j = ∑ ∂z
! jwij !
∂yi j dyi ∂z j ∂z j
y !j z j yi !
j
!y z y !
! j j
∂z ∂E
i
∂E ∂E
Algorithm
• Ini6alize the weights
• For each training sample, do the following, un6l convergence
• Perform the forward path, calculate the output of each neuron
• Calculate the output error deriva6ve
• For each weight, calculate the error deriva6ve with respect to that
weight
• Change the weights accordingly
• Calculate the errors deriva6ves with respect to the hidden layer outputs,
do the same for the layers below
Issues: op6miza6on
• How olen to update the weights?
• Online (each training sample)
• Batch: Once for the full training set
• Mini-batch: once for a small number of training samples
• How much to update?
• fixed or variable learning rate
• adap6ve learning rate, globally, or for each neuron
separately
• Adam, SGD, RMSProp, SWA, AdaTune, …
Issues: generaliza6on
• Training data contains noise
• unreliable target values
• sampling error: accidental regulari6es, try to sample the space
as much as possible
• The model can learn the signal, but also the par6cular noise
• Prevent overfiong: Weight-decay (sparse weights), Weight-sharing,
early stopping, model averaging, Dropout, Pre-training
Output
Input
21 frequency bands
Fully connected layer
11 samples (110 ms)
input output
231 6me x frequency 40 phonemes
Wx
40 x 231
Convolu6onal neural network (CNN)
Preserving the spa6al organiza6on of features, e.g., 6me and frequency
Mul6plica6ve
Input feature
filter
Frequency
e
m
Ti
The filter is convolved with the input

Convolu6on layer with N filters
Number of filters (N)
Frequency
Frequency
e
m
Ti
e
Tim
ConvNet is a series of
convolu6on and ac6va6on
Conv Conv Conv

Frequency
+ + +
f(.) f(.) f(.)
e
m
Ti
what is 1x1 Convolu6on layer?
1x1xN
Frequency
Frequency
M such
filter m e
e Ti
m
Ti M
N
Pooling
Reducing the size of the feature space
sub-frequency
m e
Frequency
- ti
b
N su
m e
Ti
N
Downsample
Frequency
Time
Max pooling
MAX POOLING
Single depth slice

1 1 2 4
x max pool with 2x2 filters
5 6 7 8 and stride 2 6 8
3 2 1 0 3 4
1 2 3 4
y
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - 73 April 17, 2018
Fei-Fei Li & Justin Johnson & Serena Yeung
Beyond CNNs: the need for mul6 scale speech
processing networks
Phonemes
Pitch Words
Recurrent neural networks (RNN)
• Extension of feed-forward neural networks

• Output feeds back as input with 6me delay
• Ac6va6on of the hidden layer depends on
the input, and the past
• Recurrent layer ac6va6ons encode “states”
• Becoming very popular in speech
recogni6on (Robinson et. al. 1993)
Simple RNN
Simple
Simple
Simple
RNN
RNN
RNN architecture in two alternative representations:
Simple RNN architecture
Simple RNN RNN in two
architecture in two alternative
alternative representations:
representations:
RNN unfolded in time
input xt xt 1 xt
input
input xWhx
tx x Whx
xt t 11 Whh xWhx
xtt
t
hidden ht WhxW ht 1 WWhx ht WWhx . . .
hx hx hx
Whh
hidden Wyh h
ht
Whh Wyh h
ht 1
t 1
Whh W
yh h
ht
.
. .
. .
.
hidden t t
output Wyyh
Wt yh WhhWhh yWtWyhyh1 WWyyhyht
output ytyt
(a) RNN
output yyt tRNN
(b) 11 unrolled yin
ytt time
RNN hidden(a)and
(a)RNN
RNN (b)
(b)RNN
output layer activations: RNNunrolled
unrolled in
in time
time
RNN
RNNhidden
hiddenand
handoutput
t = (Whxlayer
output xt +activations:
layer W hh ht 1 + bh )
activations:
ytht=t =(W
h= (W
(W hhx
yhhx txt+
xt+b+yW
)Whhhhhht t 11++bbhh))
yty=
t = (W
Google(W htht++LVCSR
yhyh
Speech bby y) )with LSTM RNNs 5/37
Training an RNN
• Backpropagate through layers and 6me
• The problem of exploding or vanishing gradients:
• As we back propagate through many layers,

the gradient either shrinks or explodes
exponen6ally
• Not a big problem for DNN, since number of

layers is limited
• Several ways to go around this, LSTM, similar

to adding skip connec6ons
Long Short-Term Memory
(LSTM, Hochreiter & Schmidhuber 97)
• Make an RNN out of modules that can

remember values for very long 6me
• Solves the vanishing gradient problem
• Enables an RNN to remember very long
dependencies (hundreds of 6me steps)
• Use a circuit similar to analog memory cells
LSTM is made of “Memory cells”
• A memory cell with write,

Implementing keep,cell in a neural network
a memory
and read
• Togates
preserve information for a long time in
keep
gate
the activities of an RNN, we use a circuit
that implements an analog memory cell.
• Informa6on is stored when
– A linear unit that has a self-link with a
weight of 1 will maintain its state.
write gate is on, kept as long as
– Information is stored in the cell by 1.73
activating its write gate.
keep gate is on, and outputs
– Information is retrieved by activating
write
gate
read
gate
the read gate.
when read gate is on
– We can backpropagate through this
output to
circuit because logistics are have nice input from
derivatives. rest of RNN rest of RNN
• All gates are sigmoid func6on,
i.e. we can use backpropaga6on
to train them
LSTM
(ht-1 , xt) (ht-1 , xt)
input output
(ht-1 , xt) ht
Input gate: scale input to cell (write) (ht-1 , xt)

Output gate: scales output from cell (read)
Forget gate: scales old cell values (reset)
(ht-1 , xt) (ht-1 , xt)
(ht-1 , xt) ht
(ht-1 , xt)
Gradient can
flow through
The need for bi-direc6onal recurrence
Example: language transla6on

Transla6on of one word depends on past and future words
Bidirec6onal RNN (e.g. BLSTM)
• The forward pass is the same as LSTM

• Backward pass looks at the reversed input sequence
• Output layer is not updated un6l both hidden layers have processed the en6re input
RNN difficul6es
• Inefficient: sequen6al means they cannot take
advantage of parallel compu6ng as much as CNNs
• Computa6onally intensive since all intermediate steps
need to be stored to calculate the output
• Transfer learning is very difficult (doesn’t really work)
• Two solu6ons:
• Dilated CNNs
• Transformers (Vaswani et.al. 2017)
Why can’t we just use CNNs?
Input waveform
(1 ms)
Convolu6on kernel size k: the length of the input

seen by a node in layer m is: m*(k-1) + k
The growth is linear with layers (doubling the recep6ve
field requires doubling the number of layers)
Dilated convolu6on
Regular convolution
Input Convolution kernel

Dilated convolution by factor l
Same kernel, but take every lth entry in the input

Solu6on: dilated convolu6on
The growth is exponen6al with layers (doubling the recep6ve

field requires adding one more layer)
e.g. Google WaveNet
Transformer (general idea)
• Input is processed at once. Each output
is a weighted sum of “all” inputs.
• Weights (aPen6on) determine which
part of the input is relevant for each
output
• Adding “posi6onal informa6on” to
encode sequen6al order
• Posi6onal encoding in transformer is
similar to the Fourier series
• Why use both sine and cosine?
Neural nets in speech processing
• Acous6c models (waveform to speech)
• Language models (NLP)
• Audio source separa6on
• Speaker, emo6on, etc. recogni6on
• Speech synthesis
• Audio compression
Topics
• Bayes Theorem
• Dimensionality reduction with PCA and LDA
• K-Means unsupervised clustering
• Gaussian density models
• Neural Network Models: linear neuron, logistic neuron,

back propagation

Lecture 4

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 4

Uploaded by

Copyright:

Available Formats

ELEN 6820

Speech and audio signal processing

Observation Feature Vector Class

• Dimensionality reduc6on with PCA and LDA

• K-Means unsupervised clustering

• Gaussian density models

• (Support Vector Machines)

• Neural Network Models: linear neuron, logis6c neuron, back

• Can be made nonlinear using kernels

• Not just for paPern recogni6on anymore, NNs are used

• Are used for acous6c modeling (feature to phonemes),

• When a neuron received enough inputs, it generates a spike

• Spikes are ﬁxed amplitude and shape, only their frequency

• Synapses are the inputs of neurons. Can have posi6ve or

• The exact learning rule is not very clear

ei Li & Justin Johnson & Serena Yeung Lecture 6 - 14

Mul6layer perceptron (MLP)

Input Hidden Output

Input Hidden Hidden Hidden Output

• Learn synap6c weights to minimize predic6on error

Fei-Fei Li & Justin Johnson & Serena Yeung

• Learning rate: step size

• Too large, unstable

• Too small, slow convergence

∂wi 2 n ∂wn dyn n !

• Correla6on among features

• Online vs. batch training

!! ∂y = ∂z dy= = xxii y(1y(1 − − y)y) !!

• For each training sample, do the following, un6l convergence

• Perform the forward path, calculate the output of each neuron

• Calculate the output error deriva6ve

• Change the weights accordingly

11 samples (110 ms)

Preserving the spa6al organiza6on of features, e.g., 6me and frequency

The ﬁlter is convolved with the input

Number of filters (N)

Conv Conv Conv

Single depth slice

• Extension of feed-forward neural networks

• The problem of exploding or vanishing gradients:

• As we back propagate through many layers,

• Not a big problem for DNN, since number of

• Several ways to go around this, LSTM, similar

• Make an RNN out of modules that can

• A memory cell with write,

Input gate: scale input to cell (write) (ht-1 , xt)

Example: language transla6on

• The forward pass is the same as LSTM

Convolu6on kernel size k: the length of the input

Input Convolution kernel

Same kernel, but take every lth entry in the input

The growth is exponen6al with layers (doubling the recep6ve

• Language models (NLP)

• Audio source separa6on

• Speaker, emo6on, etc. recogni6on

• Dimensionality reduction with PCA and LDA

• K-Means unsupervised clustering

• Gaussian density models

• Neural Network Models: linear neuron, logistic neuron,

You might also like