LING 406 Intro to Computational Linguistics Further Issues in FiniteState Methods - PowerPoint PPT Presentation

Loading...

PPT – LING 406 Intro to Computational Linguistics Further Issues in FiniteState Methods PowerPoint presentation | free to view - id: 76ee0-ZDc1Z



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

LING 406 Intro to Computational Linguistics Further Issues in FiniteState Methods

Description:

For example -iina can only occur if the prefix ta- has also occurred. ... The cans hold tuna. cans could be a plural noun or a third singular verb ... – PowerPoint PPT presentation

Number of Views:47
Avg rating:3.0/5.0
Slides: 46
Provided by: serrano5
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: LING 406 Intro to Computational Linguistics Further Issues in FiniteState Methods


1
LING 406Intro to Computational
LinguisticsFurther Issues in Finite-State
Methods
  • Richard Sproat
  • URL http//catarina.ai.uiuc.edu/L406_08/

2
This Lecture
  • Some further issues in finite-state morphology
  • Text normalization
  • Finite state syntax

3
Circumfixation in Arabic
4
Long-distance dependencies
  • You want only certain prefix/suffix combinations.
    For example -iina can only occur if the prefix
    ta- has also occurred.
  • None of the affixes on their own encodes a
    particular feature set it is only the
    prefix/suffix combination that encodes the
    feature set.
  • The only way to do this in a purely finite-state
    system is to have multiple instances of the stem,
    each of whose purposes is to remember that a
    given prefix has been seen.

5
(No Transcript)
6
Alternative approach
  • Build finite-state grammar that allows any
    combination of prefixes and suffixes.
  • Have an auxiliary mechanism that uses features
    encoded on affixes to restrict combinations of
    prefixes and suffixes.

7
(No Transcript)
8
(No Transcript)
9
But
  • All of the preceding can also be handled by
    purely finite-state methods
  • Encode features as output labels in morphological
    analysis.
  • Implement finite-state filters that disallow
    illicit combinations of features
  • Intersect these constraints with the output of
    morphological analysis.

10
Implementation using lextools
  • See http//catarina.ai.uiuc.edu/L406_08/Lectures/d
    emos08.tar

11
(No Transcript)
12
(No Transcript)
13
(No Transcript)
14
  • Note precomposing the morphology (from the
    arclist grammar) with the filter will produce an
    FST with same wasteful properties that we were
    complaining about before. There is no free lunch.
  • But you can also compose the filter on the fly
    during recognition.

15
(No Transcript)
16
Text Normalization
  • Conversion of text that includes non-standard
    words like numbers, abbreviations, misspellings .
    . . into normal words.
  • Abbreviation expansion (including novel
    abbreviations)
  • Expansion of numbers into number names
  • Correction of misspellings
  • Disambiguation in cases where there is ambiguity

17
Where is normalization needed?
  • Very little in cases like this

Alice was beginning to get very tired of sitting
by her sister on the bank, and of having nothing
to do once or twice she had peeped into the book
her sister was reading, but it had no pictures or
conversations in it, and what is the use of a
book, thought Alice without pictures or
conversation? So she was considering in her own
mind (as well as she could, for the hot day made
her feel very sleepy and stupid), whether the
pleasure of making a daisy-chain would be worth
the trouble of getting up and picking the
daisies, when suddenly a White Rabbit with pink
eyes ran close by her.
18
Where is normalization needed?
  • A lot in cases like this

19
Humans are pretty good at this can you read this?
f u cn rd ths thn u r dng btr thn ny autmtc txt
nrmlztion prgrm cn do.
20
How about this?
Aoccdrnig to a rscheearch at Cmabrigde
Uinervtisy, it deosnt mttaer in what oredr the
ltteers in a wrod are, the olny iprmoetnt tihng
is taht the frist and lsat ltteer be at the rghit
pclae. The rset can be a total mses and you can
sitll raed it wouthit porbelm. Tihs is bcuseae
the huamn mnid deos not raed ervey lteter by
istlef, but the wrod as a wlohe.
21
Or this?
Goccdrnia to a hscheearcr at Emabrigdc
Yinervtisu, it teosnd rttaem in tahw rredo the
stteerl in a drow are, the ylno tprmoetni gihnt
is taht the trisf and tsal rtteel be at the tghir
eclap. The tser can be a lotat ssem and you can
litls daer it touthiw morbelp. Siht is ecuseab
the nuamh dnim seod not daer yrvee rtetel
by fstlei, but the drow as a elohw.
22
Two components of text normalization
  • Given a string of characters in a text, what is
    the (reasonable) set of possible actual words (or
    word sequences) that might correspond to it.
  • Which of those is right for the particular
    context?

23
An illustration
He has goats
I live at King Avenue.
Lotus for Windows
123
24
Two components of text normalization
  • A component that gives you the set of
    possibilities
  • 123 one hundred (and) twenty three
  • 123 one twenty three
  • 123 one two three
  • A component that tells you which one(s) are
    appropriate to a particular context.
  • Well have more to say about the latter component
    later in the course since this fits into more
    general issues in sense disambiguation.

25
A concrete example of finite-state methods in
textnormalization digit to number name
translation
  • Factor digit string
  • 123 ? 1 102 2 101 3
  • Translate factors into number names
  • 102 ? hundred
  • 2 101 ? twenty
  • 1 101 3 ? thirteen
  • Languages vary on how extensive these lexicons
    are. Some (e.g. Chinese) have very regular (hence
    very simple) number name systems others (e.g.
    Urdu/Hindi) have a large set of number names with
    a name for almost every number from 1 to 100.
  • Each of these steps can be accomplished with FSTs

26
Urdu (Hindi) Number Names
27
Digit string factoring transducer (fragment)
28
Germanic decade flop
vier
zwanzig
24
und
29
70s
30
Digit-string to number name translation German
  • Factor digit string
  • 123 ? 1 102 2 101 3
  • Flip decades and units
  • 2 101 3 ? 3 2 101
  • Translate factors into number names
  • 102 ? hundert
  • 2 101 ? zwanzig
  • 1 101 3 ? dreizehn

31
German number grammar (fragment)
32
Concrete example from English
Consider a machine that maps between digit
strings and their reading as number names in
English. 30,294,005,179,018,903.56 ? thirty
quadrillion, two hundred and ninety four
trillion, five billion, one hundred seventy nine
million, eighteen thousand, nine hundred three,
point five six
33
566 states and 1492 arcs
34
Some notes on finite-state syntax
  • Finite-state techniques have also been applied in
    the description of natural language syntax.
  • One approach is to use finite-state acceptors to
    recognize local patterns in sentences (e.g. Local
    Grammars, Mohri, Mehryar. 1994. Syntactic
    Analysis by Local Grammars Automata. COMPLEX
    94, Budapest).
  • E.g.
  • Correct sequence of auxiliaries in English
  • John may have been eating
  • Mary would not have been running away
  • Joe could have been being run over by a truck
  • Structure of small noun phrase
  • the dog
  • the large cat
  • my three elephants

35
  • Another is to use intersection grammars
    (Voutilainen, Atro. 1994. Designing a Parsing
    Grammar. Technical Report 22. University of
    Helsinki).Schematically
  • Use a lexicon to build a lattice of possible
    lexical analyses for a sentence
  • The cans hold tuna
  • cans could be a plural noun or a third
    singular verb
  • intersect the lattice with automata implementing
    syntactic constraints. e.g.,
  • a verb cant follow a determiner, so the/dt
    cans/vb is disallowed.
  • Yet another is to use cascades of transducers.
  • Steven Abney. 1996. Partial Parsing via
    Finite-State Cascades. Journal of Natural
    Language Engineering, 2(4) 337-344.
    http//www.vinartus.net/spa/97a.pdf
  • Roche, Emmanuel. 1996. Finite-State Transducers
    Parsing Free and Frozen Sentences. Proceedings of
    the ECAI-96 Workshop on Extended Finite State
    Models of Language. Budapest.

36
Parsing with cascaded transducers a simplified
example
37
Parsing with cascaded transducers a simplified
example
38
Parsing with cascaded transducers a simplified
example
39
Parsing with cascaded transducers a simplified
example
40
Parsing with cascaded transducers a simplified
example
41
Parsing with cascaded transducers a simplified
example
42
Parsing with cascaded transducers a simplified
example
43
Parsing with cascaded transducers a simplified
example
44
A review of what weve covered in this section
  • Introduced regular languages, finite-state
    acceptors regular relations and finite-state
    transducers.
  • Talked a little bit about some of the algorithmic
    issuessuch as determinizationrelevant to
    finite-state machines.
  • Discussed phonology and finite-state phonology.
  • Presented an algorithm for context-dependent
    rewrite rule compilation.
  • Presented some techniques for finite-state
    morphological analysis.
  • Demonstrated the use of a particular finite-state
    toolkitlextoolsin describing some linguistic
    phenomena.
  • Took a brief look at other issues such as text
    normalization and finite-state approaches to
    syntax.

45
Reading
  • JM 8.1
About PowerShow.com