0% found this document useful (0 votes)
8 views

Week 8 Pca

Introduction to Applied Machine Learning
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Week 8 Pca

Introduction to Applied Machine Learning
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

IAML:&Dimensionality&Reduc6on&

Victor&Lavrenko&and&Nigel&Goddard&
School&of&Informa6cs&

Semester&1&
Overview&
• Curse&of&dimensionality&
• Different&ways&to&reduce&dimensionality&
• Principal&Components&Analysis&(PCA)&
• Example:&Eigen&Faces&
• PCA&for&classifica6on&
• WiSen&&&Frank&sec6on&7.3&
– only&the&PCA&sec6on&required&

Copyright&©&2014&Victor&Lavrenko&
True&vs.&observed&dimensionality&
• Get&a&popula6on,&predict&some&property&
– instances&represented&as&{urefu,&height}&pairs&
– what&is&the&dimensionality&of&this&data?&
• Data&points&over&6me&from&
different&geographic&areas&
over&6me:&
• X1:&#&of&skidding&accidents&
• X2:&#&of&burst&water&pipes&
• X3:&snowaplow&expenditures&
• X4:&#&of&school&closures&
• X5:&#&pa6ents&with&heat&stroke&
“height” = “urefu” in Swahili
Temperature?
Copyright&©&2014&Victor&Lavrenko&
Curse&of&dimensionality&
• Datasets&typically&high&dimensional&
– vision:&104&pixels,&text:&106&words&
• the&way&we&observe&/&record&them&
– true&dimensionality&oeen&much&lower&
• a&manifold&(sheet)&in&a&highad&space&
• Example:&handwriSen&digits&
– 20&x&20&bitmap:&{0,1}400&possible&events&
• will&never&see&most&of&these&events&
• actual&digits:&6ny&frac6on&of&events&
– true&dimensionality:&
• possible&varia6ons&of&the&penastroke&

Copyright&©&2014&Victor&Lavrenko&
Curse&of&dimensionality&(2)&
• Machine&learning&methods&are&sta6s6cal&by&nature&
– count&observa6ons&in&various&regions&of&some&space&
– use&counts&to&construct&the&predictor&f(x)&
– e.g.&decision&trees:&p+/pa&in&{o=rain,w=strong,T>28o}&
– text:&#documents&in&{“hp”&and&“3d”&and&not&“$”&and&…)&
• As&dimensionality&grows:&fewer&&observa6ons&per&region&
– 1d:&3&regions,&2d:&32&regions,&1000d&–&hopeless&
– sta6s6cs&need&repe66on&
• flip&a&coin&once&!&head&
• P(head)&=&100%?&

Copyright&©&2014&Victor&Lavrenko&
Dealing&with&high&dimensionality&
• Use&domain&knowledge&
– feature&engineering:&SIFT,&MFCC&
• Make&assump6on&about&dimensions&
– independence:&count&along&&
each&dimension&separately&
– smoothness:&propagate&class&&
counts&to&neighboring&regions&
– symmetry:&e.g.&invariance&to&&
order&of&dimensions:&x1&"&x2&
• Reduce&the&dimensionality&of&the&data&
– create&a&new&set&of&dimensions&(variables)&

Copyright&©&2014&Victor&Lavrenko&
Dimensionality&reduc6on&
• Goal:&represent&instances&with&fewer&variables&
– try&to&preserve&as&much&structure&in&the&data&as&possible&
– discrimina6ve:&only&structure&that&affects&class&separability&
• Feature&selec6on&
– pick&a&subset&of&the&original&dimensions&X1#X2#X3#…#Xd(1#Xd#
– discrimina6ve:&pick&good&class&“predictors”&(e.g.&gain)&
• Feature&extrac6on&
– construct&a&new&set&of&dimensions&
& & & &Ei&=&f(X1…Xd)&
– (linear)&combina6ons&of&original&

Copyright&©&2014&Victor&Lavrenko&
Principal&Components&Analysis&
• Defines&a&set&of&principal&components&
– 1st:&direc6on&of&the&greatest&variability&in&the&data&
– 2nd:&perpendicular&to&1st,&greatest&variability&of&what’s&lee&
– ...&and&so&on&un6l&d&(original&dimensionality)&
• First&m<<d#components&become&m#new&dimensions&
– change&coordinates&of&every&data&point&to&these&dimensions&

Copyright&©&2014&Victor&Lavrenko&
Why&greatest&variability?&
• Example:&reduce&2adimensional&data&to&1ad&
– {x1,x2}&!&e’&(along&new&axis&e)&
• Pick&e&to&maximize&variability&
• Reduces&cases&when&two&&
points&are&close&in&easpace&
but&very&far&in&(x,y)aspace&
• Minimizes&distances&&
between&original&points&&
and&their&projec6ons&
Copyright&©&2014&Victor&Lavrenko&
Principal&components&
• “Center”&the&data&at&zero:&&xi,a&=&xi,a&–&μa&
– subtract&mean&from&each&aSribute&
• Compute&covariance&matrix&Σ&
– covariance&of&dimensions&x1&and&x2:& 1 n 2
var(a) = ∑ x ia
n i=1
• do&x1&and&x2&tend&to&increase&together?& 1 n
cov(b,a) = ∑ x ib x ia
• or&does&x2&decrease&as&x1&increases?& n i=1

• Mul6ply&a&vector&by&Σ:&&&&&&&&&&&&&&&&&&&&&again&
"2.0 0.8%" −1% " −1.2%
$ '$ ' →$ '
#0.8 0.6&# +1& # −0.2&
$ −2.5'
→&€ ) →&
% −1.0(
$ −6.0' $ −14.1' $ −33.3'
) →& ) →& )
% −2.7( % −6.4 ( % −15.1(

– turns&towards&direc6on&of&variance&

slope: 0.400 0.450 0.454 0.454

€ €
• Want&vectors&e&which&aren’t&turned:&Σ&e&=&λ&e"
€ € €

– e&…&eigenvectors&of&Σ,&&λ&…&corresponding&eigenvalues&
– principal&components&=&eigenvectors&w.&largest&eigenvalues&&
Copyright&©&2014&Victor&Lavrenko&
Finding&Principal&Components&
1.&find&eigenvalues&by&solving:&det(Σ&–&λI)&=&0&
$2.0 − λ 0.8 ' 2
det& ) = (2 − λ )(0.6 − λ ) − (0.8)(0.8) = λ − 2.6 λ + 0.56 = 0
% 0.8 0.6 − λ (
( )
{λ1, λ2 } = 12 2.6 ± 2.6 2 − 4 * 0.56 = {2.36,0.23}

2.&find&i &eigenvector&by&solving:&Σ&ei&=&λi&ei&
th €

"2.0 0.8%" e1,1 % " e1,1 % 2.0e1,1 + 0.8e1,2 = 2.36e1,1
$ '$ ' € = 2.36$ ' e1,1 = 2.2e1,2
# 0.8 0.6 e
&# 1,2 & # e1,2 & 0.8e1,1 + 0.6e1,2 = 2.36e1,2
"2.2 %
"2.0 0.8%" e2,1 % " e2,1 % #−0.41& e1 ~ $ '
$ '$ ' = 0.23$ ' e2 = % ( #1 &
€ #0.8 0.6&# e2,2 & # e2,2 & $ 0.91 ' € want: ||e1|| = 1

"0.91%
3.&1st&PC:&&&&&&&,&2nd&PC:
"0.91% #−0.41&
e1 = $ '
$ '
#0.41& &%$ 0.91 (' #0.41&

€ € Copyright&©&2014&Victor&Lavrenko& slope: 0.454
€ €

Projec6ng&to&new&dimensions&
• e1&…&em&are&new&dimension&vectors& e1
• Have&instance&&x&=&{x1…xd}&(original&coordinates)&
• Want&new&coordinates&x’&=&{x’1&…&x’m}:& e2
1. “center”&the&instance&(subtract&the&mean):&x’aμ" x’-µ
2. “project”&to&each&dimension:&(x’aμ)Tej&for&j=1…m&&
! !
( x − µ) = [( x1 − µ1 ) ( x 2 − µ2 ) ! ( xd − µd ) ]
" x1 ' % " ( x" − µ" )T e"1 % " ( x1 − µ1 )e1,1 + ( x 2 − µ2 )e1,2 +# + ( x d − µd )e1,d %
$ ' $ " " T" ' $ '
µ) e2 ' $ ( x1 − µ1 )e2,1 + ( x 2 − µ2 )e2,2 +# + ( x d − µd )e2,d '
€ $ x 2 ' ' = $ ( x −€ =$ '
$ ! ' $ ! ' !
$ ' $ " " T" ' $ '
# x m '& #( x − µ) em & #( x1 − µ1 )em,1 + ( x 2 − µ2 )em,2 +# + ( x d − µd )em,d &
Copyright&©&2014&Victor&Lavrenko&
Direc6on&of&greatest&variability&
• Select&dimension&e&which&maximizes&the&variance&
• Points&xi&“projected”&onto&vector&e:&
• Variance&of&
projec6ons:&
• Maximize&variance&
V
– want&unit&length:&||e||=1&
– add&Lagrange&mul6plier& ∂V 2 n $ d '
$ d ( = ∑&&∑ x ij e j )) x ia − 2 λea = 0
∂ea n i=1 % j =1 (
& ∑ cov(1, j)e j = λ€
e1 &
& j =1 & #1 n
d &
% ! ) hold for 2 e
∑ j % n ∑ x ia x ij ( = 2λea
d
e must be an & & a=1..d $ i=1 '
eigenvector &∑ cov(d, j)e j = λed & j =1

' j =1 €* covariance of a,j


Copyright&©&2014&Victor&Lavrenko&

Variance&along&eigenvector&
Variance of projected points (xTe):
1 n # d &
µ = ∑%∑ x ij e j ((
%
n i=1 $ j =1 '
d # &
1 n
1 #d &# d & = ∑% ∑ x ij (e j
n
n i=1 '
= ∑%%∑ x ij e j ((%∑€x ia ea ( j =1 $

n i=1 $ j =1 '$ a =1 '


d d # & €
1 n
= ∑ ∑% ∑ x ia x ij (e j ea
a =1 j =1 $ n i=1 '
€ d # d & n
1
= ∑%%∑ cov(a, j)e j ((ea cov(a, j) = ∑ x ia x ij
n i=1
a =1 $ j =1 '
€ d d

= ∑ ( λea )ea ∑ cov(a, j)e j = λea e is an eigenvector of


a =1 € j =1 the covariance matrix
2
€ =λ e =λ
Copyright&©&2014&Victor&Lavrenko&


How&many&dimensions?&
• Have:&eigenvectors&e1&…&ed&&&want:&&m&<<&d#
• Proved:&eigenvalue&λi&=&variance&along&ei&
• Pick&ei&that&“explain”&the&most&variance&
– sort&eigenvectors&s.t.&λ1&≥&λ2&≥&…&≥&λd&
– pick&first&m&eigenvectors&which&
explain&90%&or&the&total&variance&
• typical&threshold&values:&0.9&or&0.95&
• Or&use&a&scree&plot:&
eigenvalue

– like&Kameans& pick 3 PCs (visually)

dimensions
Copyright&©&2014&Victor&Lavrenko&
PCA&in&a&nutshell& 3. compute covariance matrix

1. correlated hi-d data 2. center the points


(“urefu” means “height” in Swahili)

4. eigenvectors + eigenvalues

eig(cov(data))

5. pick m<d eigenvectors


w. highest eigenvalues
7. uncorrelated low-d data 6. project data points to
those eigenvectors

Copyright&©&2014&Victor&Lavrenko&
PCA&example:&Eigen&Faces&
input: dataset of N face images face: K x K bitmap of pixels “unfold” each bitmap to
K2-dimensional vector

arrange in a matrix
each face = column

K2 x N

Matlab demo on course webpage


“fold” into a K x K bitmap
PCA&
can visualize
eigenvectors:
m “aspects”
of prototypical
K2 x m
facial features

set of m eigenvectors
Copyright&©&2014&Victor&Lavrenko&
each is K2-dimensional
Eigen&Faces:&Projec6on&
= mean +

• Project&new&face&to&
space&of&eigenafaces&
• Represent&vector&as&
a&linear&combina6on&
of&principal&components&
• How&many&do&we&need?&

Copyright&©&2014&Victor&Lavrenko&
(Eigen)&Face&Recogni6on&
• Face&similarity&
– in&the&reduced&space&
– insensi6ve&to&ligh6ng&
expression,&orienta6on&
• Projec6ng&new&“faces”&
– everything&is&a&face&

new face

projected to eigenfaces

Copyright&©&2014&Victor&Lavrenko&
PCA:&prac6cal&issues&
• Covariance&extremely&sensi6ve&to&large&values&
– mul6ply&some&dimension&by&1000&
• dominates&covariance&
• becomes&a&principal&component&
– normalize&each&dimension&to&zero&mean&and&unit&variance:&&
&x’&=&(x&–&mean)&/&st.dev&
• PCA&assumes&underlying&subspace&is&linear&
– 1d:&straight&line&
2d:&flat&sheet&
– transform&to&handle&
nonalinear&spaces&
(manifolds)&
Copyright&©&2014&Victor&Lavrenko& direction of greatest variability
PCA&and&classifica6on&
• PCA&is&unsupervised&
– maximizes&overall&variance&of&the&data&along&&
a&small&set&of&direc6ons&
– does&not&know&anything&&
about&class&labels&
– can&pick&direc6on&&
that&makes&it&hard&&
to&separate&classes&
• Discrimina6ve&approach&
– look&for&a&dimension&&
that&makes&it&easy&to&&
separate&classes&
Copyright&©&2014&Victor&Lavrenko&
Linear&Discriminant&Analysis&
• LDA:&pick&a&new&dimension&that&gives:&
– maximum&separa6on&between&means&of&projected&classes&
– minimum&variance&within&each&projected&class&
• Solu6on:&eigenvectors&based&on&betweenaclass&and&
withinaclass&covariance&matrices&

Copyright&©&2014&Victor&Lavrenko&
PCA&vs.&LDA&
• LDA&not&guaranteed&to&be&beSer&for&classifica6on&
– assumes&classes&are&unimodal&Gaussians&
– fails&when&discriminatory&informa6on&is&not&in&the&mean,&
but&in&the&variance&of&the&data&
• Example&where&PCA&gives&a&beSer&projec6on:&

Copyright&©&2014&Victor&Lavrenko&
Dimensionality&reduc6on&
• Pros&
– reflects&our&intui6ons&about&the&data&
– allows&es6ma6ng&probabili6es&in&highadimensional&data&
• no&need&to&assume&independence&etc.&
– drama6c&reduc6on&in&size&of&data&
• faster&processing&(as&long&as&reduc6on&is&fast),&smaller&storage&
• Cons&
– too&expensive&for&many&applica6ons&(TwiSer,&web)&
– disastrous&for&tasks&with&fineagrained&classes&
– understand&assump6ons&behind&the&methods&(linearity&etc.)&
• there&may&be&beSer&ways&to&deal&with&sparseness&

Copyright&©&2014&Victor&Lavrenko&
Summary&
• True&dimensionality&<<&observed&dimensionality&
• High&dimensionality&#&sparse,&unstable&es6mates&
• Dealing&with&high&dimensionality:&
– use&domain&knowledge&
– make&an&assump6on:&independence&/&smoothness&/&symmetry&
– dimensionality&reduc6on:&feature&selec6on&/&feature&extrac6on&
• Principal&Components&Analysis&(PCA)&
– picks&dimensions&that&maximize&variability&
• eigenvectors&of&the&covariance&matrix&
– examples:&Eigen&Faces&
– variant&for&classifica6on:&Linear&Discriminant&Analysis&
Copyright&©&2014&Victor&Lavrenko&

You might also like