Le téléchargement nécessite un accès à la bibliothèque YouScribe Tout savoir sur nos offres
Le téléchargement nécessite un accès à la bibliothèque YouScribe Tout savoir sur nos offres
Publié par
Publié par
Machine Learning for NLP:
New Developments and Challenges
These slides are still incomplete
A more complete version will be posted at a later
date at:
Dan Klein
Computer Science Division
University of California at Berkeley
What is NLP? Speech Systems
Automatic Speech Recognition (ASR)
Audio in, text out
SOTA: 0.3% for digit strings, 5% dictation, 50%+ TV
Fundamental goal: deep understand of broad language “Speech Lab”
End systems that we want to build:
Ambitious: speech recognition, machine translation, information Text to Speech (TTS)
extraction, dialog interfaces, question answering… Text in, audio out
Modest: spelling correction, text categorization… SOTA: totally intelligible (if sometimes unnatural)
Sometimes we’re also doing computational linguistics
Machine Translation Information Extraction
Information Extraction (IE)
Unstructured text to database entries
New York Times Co. named Russell T. Lewis, 45, president and general
manager of its flagship New York Times newspaper, responsible for all
business-side activities. He was executive vice president and deputy
general manager. He succeeds Lance R. Primis, who in September was
named president and chief operating officer of the parent.
Person Company Post State
Russell T. Lewis New York Times president and general start
newspaper manager
Russell T. Lewis New York Times executive vice president end
Lance R. Primis New York Times Co. president and CEO start
Translation systems encode:
Something about fluent language
Something about how two languages correspond
SOTA: perhaps 70% accuracy for multi-sentence temples, 90%+ SOTA: for easy language pairs, better than nothing, but more an understanding aid than a
replacement for human translators for single easy fields
Question Answering Goals of this Tutorial
Question Answering:
More than search Introduce some of the core NLP tasksAsk general
questions of a
document collection Present the basic statistical modelsCan be really easy:
“What’s the capital of
Can be harder: “How Highlight recent advances
many US states’
capitals are also their
largest cities?”
Highlight recurring constraints on use of ML Can be open ended:
“What are the main techniquesissues in the global
warming debate?”
Highlight ways this audience could really help outSOTA: Can do factoids,
even when text isn’t a
perfect match
Recurring Issues in NLP Models Outline
Inference on the training set is slow enough that discriminative
methods can be prohibitive Language Modeling
Need to scale to millions of features
Indeed, we tend to have more features than data points, and it all works
out ok Syntactic / Semantic Parsing
Kernelization is almost always too expensive, so everything’s done
with primal methods Machine Translation
Need to gracefully handle unseen configurations and words at test
Information Extraction
Severe non-stationarity when systems are deployed in practice
Unsupervised LearningPipelined systems, so we need relatively calibrated probabilities,
also errors often cascade
Speech in a Slide The Noisy-Channel Model
Frequency gives pitch; amplitude gives volume We want to predict a sentence given acoustics:
s p ee ch l a b
The noisy channel approach:
Frequencies at each time slice processed into observation vectors
Acoustic model: HMMs over Language model:
word positions with mixtures Distributions over sequences
of Gaussians as emissions of words (sentences)
……………………………………………..a a a a a ………..12 13 12 14 14
Language Models Language Model Samples
Unigram:In general, we want o place a distribution over sentences
[fifth, an, of, futures, the, an, incorporated, a, a, the, inflation, most, dollars, quarter]
Classic solution: n-gram models [that, or, limited, the]
[after, any, on, consistently, hospital, lake, of, of, other, and, factors, raised, analyst,
too, allowed, mexico, never, consider, fall, bungled, davison, that, obtain, price, lines,
the, to, sass, the, the, further, board, a, details, machinists, …… , nasdaq]
Bigram:N-gram models are (weighted) regular languages [outside, new, car, parking, lot, of, the, agreement, reached]
[although, common, shares, rose, forty, six, point, four, hundred, dollars, from, thirty,
seconds, at, the, greatest, play, disingenuous, to, be, reset, annually, the, buy, out, of,
american, brands, vying, for, mr., womack, currently, share, data, incorporated, believe, Natural language is not regular
chemical, prices, undoubtedly, will, be, as, much, is, scheduled, to, conscientious, teaching]
Many linguistic arguments [this, would, be, a, record, november]
Long-distance effects:
PCFG (later):“The computer which I had just put into the machine room on the
[This, quarter, ‘s, surprisingly, independent, attack, paid, off, the, risk, involving, IRS, leaders, fifth floor crashed.” and, transportation, prices, .]
[It, could, be, announced, sometime, .]
[Mr., Toseland, believes, the, average, defense, economy, is, drafted, from, slightly, more, N-gram models often work well anyway (esp. with large n) than, 12, stocks, .]
Smoothing Interpolation / Dirichlet Priors
Dealing with sparsity well: smoothing / shrinkage
Problem: is supported by few counts
For most histories P(w | h), relatively few observations
Solution: share counts with related histories, e.g.:
Very intricately explored for the speech n-gram case
Easy to do badly
Despite classic mixture formulation, can be viewed as a
P(w | denied the)
3 allegations hierarchical Dirichlet prior [MacKay and Peto, 94]
2 reports
1 claims Each level’s distribution drawn from prior centered on back-off
0.81 request …
0.67 total Strength of prior related to mixing weights
Unigr ams
0.4 Bigrams
P(w | denied the) 0.2 Rules
2.5 allegations
0 Problem: this kind of smoothing doesn’t work well empirically1.5 reports 0 200000 400000 600000 800000 1000000
0.5 claims
Num be r of Wor ds0.5 request …2 other
All the details you could ever want: [Chen and Goodman, 98]7 total
Kneser-Ney: Discounting Kneser-Ney: Details
N-grams occur more in training than they will later: Kneser-Ney smoothing combines several ideas
Absolute discounting
Count in 22M Words Avg in Next 22M Good-Turing c*
1 0.448 0.446
2 1.25 1.26
3 2.24 2.24 Lower order models take a special form
4 3.23 3.24
Absolute Discounting
Save ourselves some time and just subtract 0.75 (or some d) KN smoothing repeatedly proven effective
Maybe have a separate value of d for very low counts But we’ve never been quite sure why
And therefore never known how to make it better
[Teh, 2006] shows KN smoothing is a kind of approximate
inference in a hierarchical Pitman-Yor process (and better
approximations are superior to basic KN)
all allegat ions allegations
reports reports
claims claims
request request
outcome outcome
Fraction Seen„
Data >> Method? Beyond N-Gram LMs
Lots of ideas we won’t have time to discuss:Having more data is always better…
Caching models: recent words more likely to appear again
10 Trigger models: recent words trigger other words
9.5 100,000 Katz Topic models
9 100,000 KN
8.5 1,000,000 Katz A few recent ideas I’d like to highlight
8 1,000,000 KN
7.5 10,000,000 Katz
Syntactic models: use tree models to capture long-distance
7 10,000,000 KN syntactic effects [Chelba and Jelinek, 98]
6.5 all Katz
6 all KN
Discriminative models: set n-gram weights to improve final task
5.5 accuracy rather than fit training set density [Roark, 05, for ASR;
1234567891020 Liang et. al., 06, for MT]
n-gram order
Structural zeros: some n-grams are syntactically forbidden, keep … but so is using a better model estimates at zero [Mohri and Roark, 06]
Another issue: N > 3 has huge costs in speech recognizers
Outline Phrase Structure Parsing
Phrase structure parsing
Language Modeling organizes syntax into
constituents or brackets
In general, this involves Syntactic / Semantic Parsing
nested trees
Linguists can, and do,
Machine Translation argue about what the
gold structu