94
pages
English
Documents
Le téléchargement nécessite un accès à la bibliothèque YouScribe Tout savoir sur nos offres
94
pages
English
Documents
Le téléchargement nécessite un accès à la bibliothèque YouScribe Tout savoir sur nos offres
Publié par
Langue
English
Language Modeling in IR
Tutorial, SIGIR 2003
Victor Lavrenko
Department of Computer Science,
University of Massachusetts, Amherst
lavrenko@cs.umass.edu
© Victor Lavrenko, Jul. 27, 2003What I hope to Accomplish
• General Language Modeling framework
• Discuss different methods of estimation
• Discuss a few applications of LMs
© Victor Lavrenko, Jul. 27, 2003Outline of the Tutorial
• Introduction to Language Models
– what is a language model?
– how can we use language models?
– what are the major issues in language modeling?
• Estimation of Language Models
– Basic Models,
– Translation Models,
– Aspect Models,
– Non-parametric Models
• Bayesian framework for estimation of LMs
• Case study: Relevance Models
© Victor Lavrenko, Jul. 27, 2003What is a Language Model?
• Probability distribution over strings of text
– how likely is a given string (observation) in a given “language”
– for example, consider probability for the following four strings
– English: p > p > p > p
1 2 3 4
p = P(“a quick brown dog”)
1
p = P(“dog quick a brown”)
2
p = P(“быстрая brown dog”)
3
p = P(“быстрая собака”)
4
• … depends on what “language” we are modeling
– in most of this tutorial we will have p == p
1 2
– for some applications we will want p to be highly probable
3
© Victor Lavrenko, Jul. 27, 2003Language Modeling Notation
• Convenient to make explicit what we are modeling:
M … “language” we are trying to model
s … observation (string of tokens from some vocabulary)
P(s|M) … probability of observing “s” in language M
• M can be thought of as a “source” or a generator
– a mechanism that can spit out strings that are legal in the language
P(s|M) … probability of getting “s” during random sampling from M
© Victor Lavrenko, Jul. 27, 2003wind
weather
hurricane
How can we use LMs in IR?
• Task: given a query, retrieve relevant documents
• Use LMs to model the process of query generation:
– user thinks of some relevant document
– picks some keywords to use as the query
Relevant Docs
Forecasters are watching two
tropical storms that could pose
hurricane threats to the southern
United States. One is a
downgraded
tropical storms
© Victor Lavrenko, Jul. 27, 2003
t
r
o
p
i
c
a
l
f
o
r
e
c
a
s
t
g
u
l
f
s
t
o
r
mLanguage Modeling for IR
• Every document in a collection defines a “language”
– consider all possible sentences (strings) that author could have
written down when creating some given document
– some are perhaps more likely to occur than others
• subject to topic, writing style, language …
– P(s|M ) … probability that author would write down string “s”
D
• think of writing a billion variations of a document and counting how many time we get “s”
• Now suppose “q” is the user’s query
– what is the probability that author would write down “q” ?
• Rank documents D in the collection by P(q|M ) [1]
D
– probability of observing “q” during random sampling from the
language model of document D
© Victor Lavrenko, Jul. 27, 2003Other applications: same idea
• Topic Detection and Tracking
– query “q” can be a topic description, or an on-topic story
– documents with high P(q|M ) probably discuss the same topic
D
• Classification / Filtering
– query can be a set of training documents for a particular class
– or testing docs can reflect observations from model of training set
• Cross-language Retrieval
– query can be in a different language from document collection
– author could have written a document in a different language
• Multi-media Retrieval
– languages don’t have to be textual (e.g. spoken or handwritten docs)
– extends to images, sounds, video, preferences, hyperlinks, …
© Victor Lavrenko, Jul. 27, 2003Is _____ a LM technique?
• How do we determine if a given model is a LM?
• LM is generative
– at some level, a language model can be used to generate text
– explicitly computes probability of observing a string of text
– Ex: probability of observing a query string from a document model
probability of observing an answer from a question model
– model an entire population
• Discriminative approaches
– model just the decision boundary
– Ex: is this document relevant?
does it belong to class X or Y
– have a lot of advantages, but these are not generative approaches
© Victor Lavrenko, Jul. 27, 2003Language Modeling: pros & cons
• Pros:
– formal mathematical model
– simple, well-understood framework
– integrates both indexing and retrieval models
– natural use of collection statistics, no heuristics
– avoids tricky issues of “relevance”, “aboutness”, etc.
• Cons:
– difficult to incorporate notions of “relevance”, user preferences
– relevance feedback / query expansion not straightforward
– can’t accommodate phrases, passages, Boolean operators
• Extensions of LM overcome some issues
– Probabilistic LSI, Relevance Models, etc…
© Victor Lavrenko, Jul. 27, 2003