Download as pdf or txt
Download as pdf or txt
You are on page 1of 50

ELEN 6820

Speech and audio signal processing


Instructor: Nima Mesgarani (nm2764)
3 credits
Office hours: Wednesday 4-5pm
TA:Yi Luo
Pattern Classification
Classification
Goal: To classify objects (or patterns) into categories (or classes)

! Feature ! !
Extraction Classifier

Observation Feature Vector Class


s x ωi

What
•Types is machine
of Problems: learning?

1. Can you learn
Supervised: Classesany
areone-to-one mapping
known beforehand, given
and data enough
samples of
training
each class data?
are available
2. Unsupervised: Classes (and/or number of classes) are not known
• Supervised vs. Unsupervised
beforehand, and must be inferred from data
• Deterministic vs. Stochastic
Generative vs. Discriminative
•6.345 Automatic Speech Recognition Pattern Classification 2
Topics
• Bayes Theorem

• Dimensionality reduc6on with PCA and LDA

• K-Means unsupervised clustering

• Gaussian density models

• (Support Vector Machines)

• Neural Network Models: linear neuron, logis6c neuron, back


propaga6on
Support Vector Machines (SVM)

• Can be made nonlinear using kernels


• Depends only on training points near the decision boundary,
aka Support Vectors
• Unique, op6mal solu6on
Neural Networks
• Inspired by biological neural networks

• Not just for paPern recogni6on anymore, NNs are used


for signal processing, representa6on, separa6on, and
compression

• Are used for acous6c modeling (feature to phonemes),


language modeling (words to text), and end-to-end
models
A typical cor6cal neuron
• Cell body, many inputs (synapses), one output (axon)

• When a neuron received enough inputs, it generates a spike

• Spikes are fixed amplitude and shape, only their frequency


(spike per second) varies

• Synapses are the inputs of neurons. Can have posi6ve or


nega6ve effect, Can change efficacy (adapt, learn, etc.).

• The exact learning rule is not very clear


A shallow look at deep neural networks:
tivation Functionsbiologically inspired

Weight 2
Input 2
- Computes f(x) =
Activation Functions
Weight 1
- Does not saturate
Input 1 - Very computation
- Converges much
sigmoid/tanh in p
- Actually more bio
than sigmoid

ei Li & Justin Johnson & Serena Yeung Lecture 6 - 14


ReLU April 19,
(Rectified Linear Unit)
Li, Johnson,Yeung, 2018
Neural network model
(RosenblaP, 1958)

Mul6layer perceptron (MLP)

Input Output
layer layer
Adding a hidden layer: universal approximator
(Backpropaga6on, 1986)

Input Hidden Output


layer layer layer
Deep (mul6layer) neural network

W1 W2 W3 W4

Input Hidden Hidden Hidden Output


layer layer 1 layer 2 layer 3 layer

• Learn synap6c weights to minimize predic6on error


• Can be highly regularized and op6mized
History of neural networks
1943 McCulloch and PiPs proposed the McCulloch-PiPs neuron model
1949 Hebb published his book The Organiza6on of Behaviour,
1958 RosenblaP introduced the simple single layer networks called Perceptrons
1969 Minsky and Papert’s book Perceptrons demonstrated the limita6on of
single layer perceptrons
1980 Grossberg introduced his Adap6ve Resonance Theory (ART)
1982 Hopfield published a series of papers on Hopfield networks
1982 Kohonen developed the Self-Organizing Feature Maps
1986 Back-propaga6on learning algorithm for mul6-layer perceptrons was re-
discovered, and the whole field took off again
1990s ART-variant networks were developed
1990s Radial Basis Func6ons were developed
2010s Deep neural networks, Convolu6onal neural networks, etc.
Model neurons
• Many ways to model a neuron, one example is logis6c
neuron (smooth, differen6able)

− x −1
y = (1 + e )
dy − x −2 −x
= −(1 + e ) (−e ) !
dx
Enables probabilis6c interpreta6on !
( x− µ )2
1 − 2
Ac6va6on func6ons

Fei-Fei Li & Justin Johnson & Serena Yeung


Gradient Descent

• Learning rate: step size

• Too large, unstable

• Too small, slow convergence


Learning with a linear neuron
y
y = ∑ wi xi = WT X ! T

y=∑ i wi xi = W X !
1i For one sample: +
y = ∑1w∑= − y ) !
n n 2
i xi = n W nX2 !
E (t T
E =i 2 ∑ n (t − y ) !
2 n (training) n wi
∂E1 1 n ∂ynn 2dEn n
E ∂E= ∑ = 1(t∑−∂yy n)dE! n = − ∑ nxi (t n n
− ny )
n
xi
∂w2i =n 2 ∑n ∂wn dyn = − ∑n xi (t − y ) !
n

∂wi 2 n ∂wn dyn n !


∂E 1 ∂E ∂E
∂y dE
∂w
Δw=i = −∑ ε
Δwi =2−ε ∂w = dy n
=
∑ ∑ nε
ε=
x n i− ∑
x n n
(t
(t
n


x
y i ) !
) !
n (t − y )
n nn
y n

n ∂w i
i
i n n !
z == bb ++ ∑
i n
∂Exxw i wi ! !
Δwi = −ε∑i i =i ∑ ε xi (t − y ) !
z n n n

i∂w n
i
1
z =yy =b= + ∑−−xzzi w
1 !i !
!
11++ ee
Training with gradient descent

• Correla6on among features

• Online vs. batch training

• Learning rate
∂E ∂y
Δw
z∂w
∂w = =
ib==+−ε∑
= 1
∑ ∑ ∑ x i∂ww
∂w
= n∑
!
dEε xn
n =∂w
(t= −− ∑
∑y∑
n−i n xnn (t n − y n )
= − )
x !
ix (t
i (t − −
y y) )
dy+ ∑ xi winn !n
n n
∂w
i
∂wiii 2 i nnn ∂w22 i n dy
i z =n b dy n i
! !
!
Learning
z = b + ∑ ∂E
∂E
x i i with n
∂Ew !

i
a nlogis6c neuron
Δw Δw == −1
− ε ε ! y==∑ = = ∑ ε ε n (t n − y n ) ! !
ε xi i(t! (t −−y y) !)
n n n n
y =ii = −ε− z∂w
Δw i i 1 x x
i
1 +1e ∂w ∂wii i 1nn n+ e − z
y∂z= !
zz = == bbb1=+++xe∑ ∑ −
!
z xx xi
i
w
w w i ∂z!
i i i! ! = x !
∂w∂zi ∂wi
ii i

= 1xi !
ii y
yy∂z∂w= 11 ! ∂z
== 1=+we −−!−zzz !!
i
= wi !
∂z
∂xi = wi ! 1
1 +
+ e e
i ∂xi
∂z
∂z
∂z
∂x
f(.)
i = x ! dy
dy == xxii i!! = y(1 − y) !
∂wdyii= y(1 − y) !dz
∂w
∂w
dz i= y(1 − y) ! ∂y ∂z dy wi
∂z
∂zdz
∂z
∂y = w ∂z
===ww∂z !
ii ! dy ∂w dy = = xi y(1 − y) !
x
∂x ∂y ! = ∂w
ix y(1 − i y) !
dz i
∂x
∂x
∂w ii =
∂w
i
dz = xi y(1 − y) !n
i
∂w ii ∂w i dz ∂E ∂y ∂E
dy
dy
∂E
dy∂E =
i

= y(1
y(1 −

i
∂y
∂y y)
y) n !∂w
n
!
∂E ∂E
= ∑ n ∂wni ∂y
n
= − ∑ x n n
i y (1 − y n
)(t n
− y n
) !
dz
dz
∂w
dz
∂wii
= =∑
= ∑
y(1 −
∂w
y)
!
!
∂y
i
n= =
− ∑− ∑ x x
y nn n
i y
(1 − (1y−n n nn
y
)(t −
)(t nn
y −
) ! y n
) !
nn ∂w i i∂y
n i
∂y ∂z
∂y = ∂z dyback!propagation:!dy n n

!! ∂y = ∂z dy= = xxii y(1y(1 − − y)y) !!


∂wi = ∂w
∂w ∂wi dz dz != xi y(1 − y) !
Mul6layer network and back propaga6on
yj
• Mul6ple layers of nonlinear
j
zj transforma6on
• Can show that “any”
wij func6on can be modeled
• Training using random
i
yi perturba6on (e.g. par6cle
filtering)
• Can randomly perturb the
weights, or ac6va6ons
• Inefficient! back
propaga6on
E∂w = i ∑∂w (t ij −dzy j ) ! 1
back!propagation:!
! 2 j E = ∑ (t j − y j ) ! 2
∂E 1 =∑ ∂y n
∂E
2 n = − ∑ xi y (1 − y )(t − y ) !
!∂E Error 2
deriva6ve
n j n w.r.t output: n n n
E= ∂w2i ∑ = −(t
(t
1 n j i j !2 −
∂w − y y )∂y ) ! ∂E
= i j ∑ (t j − y j ) !
j j
E∂y = −(t j − y j ) !
n

! 2 j ∂y j
∂E ∂E dy j ∂E ∂E
∂E ==−(t j − y j ) =! y j (1 −∂E
back!propagation:! y j ) dy!j ∂E ∂E yj
∂y∂z
Error
! i =
deriva6ve −(t
dz ∂y
w.r.t − total
y ) !
input received= by neuron J: = y j (1 − y j )
∂y !
∂y
j j j j j ∂z j dz j ∂y j j
∂y j Layer J
∂E
∂E =1 dz i dy ∂E
j j ∂E ∂E ∂E
E ∂E == ∑∑dy(tj j ∂E − =y j=y) j∑ 2
! −w
(1 ∂Eyij j ) !∂E ! j ∂E
dz ∂E j
∂y∂zi j 2=dz j j dy
j ∂y ∂z
j = y (1
jj ∂y − ∂z=
y ∑
∂y
) j
dy ! ∂z
= ∑ wij
∂z
! z j

∂z j dzdz j ∂y j ∂y j
i j jj
i j i j j j wij
∂E
! ∂E ∂Ea weight connec6ng ∂E Layer I
∑ ∑
Error deriva6ve jw.r.t neurons i & j:
= = −(t − =
y ) ! w! ij !
∂y∂E∂E∂y ∂z dz ∂z
∂Ej ∂E ∂E ∂z ∂E y
==ij ∑ ji j = = ∑
j j i
dy ∂E
w j ∂z ! ∂E ∂E i
i
∂y dy ∂z y j
! ij =
∂z
j
= yi !
! ∂wiji∂E ∂w j dy∂z ∂zj ∂w
i
ij j ∂E
i j j j ij ∂w
j
∂E ij ∂z j ∂z j
!∂E ∂z= ∂E =∂E y j (1 − y j ) !
! deriva6ve
∂z j dz ∂y ! ∂y j
Error
∂E = j
∂z j =j y
w.r.t
∂E
hidden outputs:
∂E !
∂wjij =∂w
y z j ij ∂z y ! i
∂z y z y !
dz j ∂E= yi j ! ∂E
j
∂E i j j j i

! ! ∂w ij = ∑ ∂wij ∂z j = ∑ ∂z
! jwij !
∂yi j dyi ∂z j ∂z j
y !j z j yi !
j

!y z y !
! j j
∂z ∂E
i
∂E ∂E
Algorithm
• Ini6alize the weights

• For each training sample, do the following, un6l convergence

• Perform the forward path, calculate the output of each neuron

• Calculate the output error deriva6ve

• For each weight, calculate the error deriva6ve with respect to that
weight

• Change the weights accordingly

• Calculate the errors deriva6ves with respect to the hidden layer outputs,
do the same for the layers below
Issues: op6miza6on
• How olen to update the weights?
• Online (each training sample)
• Batch: Once for the full training set
• Mini-batch: once for a small number of training samples
• How much to update?
• fixed or variable learning rate
• adap6ve learning rate, globally, or for each neuron
separately
• Adam, SGD, RMSProp, SWA, AdaTune, …
Issues: generaliza6on
• Training data contains noise
• unreliable target values
• sampling error: accidental regulari6es, try to sample the space
as much as possible
• The model can learn the signal, but also the par6cular noise
• Prevent overfiong: Weight-decay (sparse weights), Weight-sharing,
early stopping, model averaging, Dropout, Pre-training
Output

Input
21 frequency bands
Fully connected layer

11 samples (110 ms)

input output
231 6me x frequency 40 phonemes
Wx
40 x 231
Convolu6onal neural network (CNN)

Preserving the spa6al organiza6on of features, e.g., 6me and frequency

Mul6plica6ve
Input feature
filter
Frequency

e
m
Ti

The filter is convolved with the input


Convolu6on layer with N filters

Number of filters (N)

Frequency
Frequency

e
m
Ti
e
Tim
ConvNet is a series of
convolu6on and ac6va6on

Conv Conv Conv


Frequency

+ + +
f(.) f(.) f(.)
e
m
Ti
what is 1x1 Convolu6on layer?

1x1xN

Frequency
Frequency

M such
filter m e
e Ti
m
Ti M
N
Pooling
Reducing the size of the feature space

sub-frequency
m e
Frequency

- ti
b
N su

m e
Ti
N
Downsample
Frequency

Time
Max pooling
MAX POOLING

Single depth slice


1 1 2 4
x max pool with 2x2 filters
5 6 7 8 and stride 2 6 8

3 2 1 0 3 4

1 2 3 4

y
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 - 73 April 17, 2018
Fei-Fei Li & Justin Johnson & Serena Yeung
Beyond CNNs: the need for mul6 scale speech
processing networks

Phonemes

Pitch Words
Recurrent neural networks (RNN)

• Extension of feed-forward neural networks


• Output feeds back as input with 6me delay
• Ac6va6on of the hidden layer depends on
the input, and the past
• Recurrent layer ac6va6ons encode “states”
• Becoming very popular in speech
recogni6on (Robinson et. al. 1993)
Simple RNN
Simple
Simple
Simple
RNN
RNN
RNN architecture in two alternative representations:
Simple RNN architecture
Simple RNN RNN in two
architecture in two alternative
alternative representations:
representations:
RNN unfolded in time
input xt xt 1 xt
input
input xWhx
tx x Whx
xt t 11 Whh xWhx
xtt
t
hidden ht WhxW ht 1 WWhx ht WWhx . . .
hx hx hx
Whh
hidden Wyh h
ht
Whh Wyh h
ht 1
t 1
Whh W
yh h
ht
.
. .
. .
.
hidden t t
output Wyyh
Wt yh WhhWhh yWtWyhyh1 WWyyhyht
output ytyt
(a) RNN
output yyt tRNN
(b) 11 unrolled yin
ytt time
RNN hidden(a)and
(a)RNN
RNN (b)
(b)RNN
output layer activations: RNNunrolled
unrolled in
in time
time

RNN
RNNhidden
hiddenand
handoutput
t = (Whxlayer
output xt +activations:
layer W hh ht 1 + bh )
activations:
ytht=t =(W
h= (W
(W hhx
yhhx txt+
xt+b+yW
)Whhhhhht t 11++bbhh))
yty=
t = (W
Google(W htht++LVCSR
yhyh
Speech bby y) )with LSTM RNNs 5/37
Training an RNN
• Backpropagate through layers and 6me

• The problem of exploding or vanishing gradients:

• As we back propagate through many layers,


the gradient either shrinks or explodes
exponen6ally

• Not a big problem for DNN, since number of


layers is limited

• Several ways to go around this, LSTM, similar


to adding skip connec6ons
Long Short-Term Memory
(LSTM, Hochreiter & Schmidhuber 97)

• Make an RNN out of modules that can


remember values for very long 6me
• Solves the vanishing gradient problem
• Enables an RNN to remember very long
dependencies (hundreds of 6me steps)
• Use a circuit similar to analog memory cells
LSTM is made of “Memory cells”

• A memory cell with write,


Implementing keep,cell in a neural network
a memory
and read
• Togates
preserve information for a long time in
keep
gate
the activities of an RNN, we use a circuit
that implements an analog memory cell.
• Informa6on is stored when
– A linear unit that has a self-link with a
weight of 1 will maintain its state.
write gate is on, kept as long as
– Information is stored in the cell by 1.73
activating its write gate.
keep gate is on, and outputs
– Information is retrieved by activating
write
gate
read
gate
the read gate.
when read gate is on
– We can backpropagate through this
output to
circuit because logistics are have nice input from
derivatives. rest of RNN rest of RNN
• All gates are sigmoid func6on,
i.e. we can use backpropaga6on
to train them
LSTM
(ht-1 , xt) (ht-1 , xt)

input output
(ht-1 , xt) ht

Input gate: scale input to cell (write) (ht-1 , xt)


Output gate: scales output from cell (read)
Forget gate: scales old cell values (reset)
(ht-1 , xt) (ht-1 , xt)

(ht-1 , xt) ht

(ht-1 , xt)

Gradient can
flow through
The need for bi-direc6onal recurrence

Example: language transla6on


Transla6on of one word depends on past and future words
Bidirec6onal RNN (e.g. BLSTM)

• The forward pass is the same as LSTM


• Backward pass looks at the reversed input sequence
• Output layer is not updated un6l both hidden layers have processed the en6re input
RNN difficul6es
• Inefficient: sequen6al means they cannot take
advantage of parallel compu6ng as much as CNNs
• Computa6onally intensive since all intermediate steps
need to be stored to calculate the output
• Transfer learning is very difficult (doesn’t really work)
• Two solu6ons:
• Dilated CNNs
• Transformers (Vaswani et.al. 2017)
Why can’t we just use CNNs?

Input waveform
(1 ms)

Convolu6on kernel size k: the length of the input


seen by a node in layer m is: m*(k-1) + k
The growth is linear with layers (doubling the recep6ve
field requires doubling the number of layers)
Dilated convolu6on
Regular convolution

Input Convolution kernel


Dilated convolution by factor l

Same kernel, but take every lth entry in the input


Solu6on: dilated convolu6on

The growth is exponen6al with layers (doubling the recep6ve


field requires adding one more layer)
e.g. Google WaveNet
Transformer (general idea)
• Input is processed at once. Each output
is a weighted sum of “all” inputs.
• Weights (aPen6on) determine which
part of the input is relevant for each
output
• Adding “posi6onal informa6on” to
encode sequen6al order
• Posi6onal encoding in transformer is
similar to the Fourier series
• Why use both sine and cosine?
Neural nets in speech processing
• Acous6c models (waveform to speech)

• Language models (NLP)

• Audio source separa6on

• Speaker, emo6on, etc. recogni6on

• Speech synthesis

• Audio compression
Topics
• Bayes Theorem

• Dimensionality reduction with PCA and LDA

• K-Means unsupervised clustering

• Gaussian density models

• Neural Network Models: linear neuron, logistic neuron,


back propagation

You might also like