21
pages
English
Documents
Le téléchargement nécessite un accès à la bibliothèque YouScribe Tout savoir sur nos offres
21
pages
English
Documents
Le téléchargement nécessite un accès à la bibliothèque YouScribe Tout savoir sur nos offres
T ex t Su mma riz a ti on
Yo un, K i m
C S 29 8
Sa n J o se St a t e Un i v e rsi t yO utli ne
• Int ro d uct i o n
• T h e o ry a nd C o nc e p t s
• Des i g n a nd Imp l e ment a t i o n
• Summa ry Ev a l ua t i o n
• C o nc l us i o nInt ro d uct i o n
• Mot i v a t i on
– Ne e d f o r an e f f ic i e n t t e x t s u m m ar iz e r d u e t o t he o v e r w he l m in g am o u n t o f
t e x t u al in f o r m at io n av ail ab l e o n t he We b .
– U s e f u l f o r a r e ad e r t o ha v e acce s s t o a co n ci s e s u m m ar y t ail o r e d t o his o r
he r in t e r e s t s t o q u ic k l y b r o w s e t hr o u g h a l ar g e n u m b e r o f b l o g s it e s .
– T he u t il iz at io n o f S Q L , w hich can b e e as il y co n v e r t e d in t o P ig L at in , an
S Q L -l ik e d at a t r ans f o r m at io n l ang u ag e d e v e l o p e d at Yaho o and al l ow s f or
m as s i ve pa r al l e l pr oce s s i n g of l ar ge da t a s e t s ac r os s c l us t e r s by
c om pili n g qu e r i e s i n t o M ap R e du c e j o b s an d e x e c ut i n g t he m i n H ad o o p.
• G oal
– C r e at e a t e x t s u m m ar iz e r t o p r o v id e co n d e n s e d v e r s io n s o f o r ig in al t e x t b y
id e n t if y in g t he b e s t ap p r o x im at io n o f o r ig in al t e x t .T h e o ry a nd C o nc e p t s
• Th e Lan c z os A lgo rit h m
– C an d e t e r m in e t he e ig e n v al u e s f o r a l ar g e s p ar s e m at r ix e f f ic ie n t l y t hr o u g h
t he e m p l o y m e n t o f t he L ancz o s r e cur s io n , w hich co n v e r t s t he o r ig in al m at r ix
A in t o t r id iag o n al m at r ix T t hr o u g h a f in it e n u m b e r o f o r t ho g o n al s im il ar it y
t r ans f o r m at io n s .
– T he e ig e n v al u e s f o r t r id iag o n al m at r ix T ar e ap p r o x im at e t o t ho s e o f t he
o r ig in al m at r ix , A.
– T he e ig e n v e ct o r s o f A can b e f o u n d t hr o u g h t he m u l t ip l ic at io n o f t he
e ig e n v e ct o r s o f T b y t he L ancz o s v e ct o r s acq u ir e d f r o m t he r e cur s io n .
– T he n u m b e r o f ar it hm e t ic al o p e r at io n s r e q u ir e d t o g e n e r at e a t r id iag o n al
m at r ix is p r o p o r t io n al t o t he n u m b e r o f n o n z e r o e n t r ie s o f A, w hich s av e s
r u n n in g t im e f o r a l ar g e s p ar s e m at r ix .T h e o ry a nd C o nc e p t s (co nt .. )
• SVD (s i ng ular v a l ue d e c o mp o si t i o n)
– T h e SVD th eo rem i s u s u a l l y p res ented a s :
TA U S Vnxp nxn nxp pxp
T T T TUU U U I V V VV Iw h ere a nd
– B a s ed o n th e l i nea r a l g eb ra th eo rem th a t a rec ta ng u l a r m a trix A c a n
b e d eco m p o s ed i nto th e p ro d u c t o f th ree m a trices ( a n o rth o g o na l
m a trix U , a d i a g o na l m a trix S, a nd th e tra ns p o s e o f a n o rth o g o na l
m a trix V) , a nd rec o ns tru c ted b y m u l tipl yi ng th e th ree m a trices
to g eth er.T h e o ry a nd C o nc e p t s (co nt … )
• SVD (co nt … )
T– C al c ul at in g SV D c o n sist s o f fin d in g e ig e n val ue s an d AA
T T TA A AAe ig e n ve c t o r s o f an d . Th e e ig e n ve c t o r s o f A A
make up t h e c o l umn s o f U an d t h e e ig e n ve c t o r s o f
make up t h e c o l umn s o f V . Th e sin g ul ar val ue s in S c o n t ain t h e
T Tsq uar e r o o t s o f t h e e ig e n val ue s fr o m o r an d ar eAA A A
p l ac e d al o n g t h e d iag o n al o f S in d e sc e n d in g o r d e r .T h e o ry a nd C o nc e p t s (co nt … )
• SVD (co nt … )
– I t i s a m eth o d f o r red u c i ng a h i g h - d i m ens i o na l s et o f d a ta to a
l o w er- d i m ens i o na l s et, w h i c h a l l o w s u s to i d enti f y w h i c h d a ta
exh i b i t th e m o s t va ria tion th ro u g h a n o rd eri ng o f th e d i m ens i o ns .
– I t g i ves u s th e b es t a p p ro xi m a tion o f th e o rigina l d a ta b y s i m p l y
i g no ring d i m ens i o ns b el o w c erta i n th res h o l d s , a nd i n d o i ng s o th i s
a p p ro a c h red u c es th e vo l u m e o f c o ntent, w h i l e m a i nta i ni ng th e
m a i n rel a tions h i p s th a t a re p res ent.T h e o ry a nd C o nc p e t s (co nt … )
• Eigenv a l ue s a nd Eigenv e c t o rs
– I f a no nzero vect o r s a tisf i es th e eq u a tion b el o w , vect o r v i s c a l l ed
a n ei g envecto r, a nd s c a l a r i s c a l l ed a n ei g enva l u e.
= w h ere A i s a s q u a re m a trix.A v v
– E i g envecto rs a nd ei g enva l u es a re i m p o rta nt i n m a ny a rea s o f
m a th em a tics a nd p h ys i c s .
– E i g envecto rs a nd ei g enva l u es c a n tel l u s s o m eth i ng i m p o rta nt
a b o u t th e m a trix s u c h a s u nd erl yi ng o r h i d d en s tru c tu re o f m a trix.T h e o ry a nd C o nc e p t s (co nt … )
• Fo r my p ro j e c t :
– Wo r k e d w it h a w o r d -b y -se n t e n c e m at r ix A, w he r e A r e p r e s e n t s t heij
f r e q u e n cy o f a p ar t ic u l ar w o r d ap p e ar in g in e ach s e n t e n ce .
T
– U s in g S V D, A can b e d e co m p o s e d in t o t hr e e m at r ic e s ( U , S , and V )
U– Each n u m b e r in d ic at e s ho w s t r o n g l y r e l at e d a w o r d is t o t he t o p ic o rij
co n ce p t r e p r e s e n t e d b y s e m ant ic d im e n s io n , w hil e e ach n u m b e r Vij
in d ic at e s ho w s t r o n g l y r e l at e d s e n t e n ce is t o t he t o p ic r e p r e s e n t e d b y
s e m ant ic d im e n s io n . Each n u m b e r o n t he d iag o n al o f S in d ic at e s t he
im p o r t ance o f t he co r r e s p o n d in g s e m ant ic d im e n s io n .T h e o ry a nd C o nc e p t s (co nt … )
• T o e x t r ac t s e nte nc e for e ac h t opi c for t h e s u m m ar y i n m y
pr oj e c t :
TE a ch e n t ry i n t h e m a t ri x g i v e s i n fo rm a t i o n a b o u t h o w clo s e ly t h e s e n t e n ce i s re lat e d t oV
t h e g i v e n co n ce p t . A h i g h e r v a lu e m e a n s t h a t t h e s e n t e n ce i s m o re clo s e ly re lat e d t h i s
co n ce p t . Thu s , t h e s e n t e n ce t h a t i s m o s t re lat e d t o e a ch co n ce p t i s ch o s e n fo r t h e
s u m m a ry u n t i l a p re d e fi n e d n u m b e r o f s e n t e n ce s i s e x t ra ct e d .
Sentence 1 Sentence 2 Sentence 3 Sentence 4
Topic 1 0.22 1.51 2.11 7.67
Topic 2 5.33 3.22 1.10 2.43
Topic 3 3.43 5.34 0.74 0.71
Topic 4 2.11 1.31 9.54 2.33