Introduction to the xfst Interface - PowerPoint PPT Presentation

About This Presentation
Title:

Introduction to the xfst Interface

Description:

A Regular Relation is a set of ordered pairs of strings, e.g. ... e.g. $b denotes the language of all strings that contain a b' anywhere ... – PowerPoint PPT presentation

Number of Views:372
Avg rating:3.0/5.0
Slides: 43
Provided by: anneschi
Category:

less

Transcript and Presenter's Notes

Title: Introduction to the xfst Interface


1
Introduction to the xfst Interface
  • Review
  • Introduction to Morphology
  • Relations and Transducers
  • Introduction to xfst

2
Basic Formal-Language Review
  • What is a Symbol?
  • What is an Alphabet?
  • What is a string (word)?
  • What is a Language?
  • What basic operations can be performed on Sets?
  • What basic operations can be performed on
    Languages?

3
Formal Languages and Natural Languages
  • Any set of strings is a formal language
  • L1 a, aa, aaa, aaaa, aaaaa,
  • L2 zzmy, niwhiuhew, sjehuiwheu
  • L3 dog, cat, elephant
  • The systems that we write will accept or map
    words in a formal language.
  • In practical natural-language processing, we try
    to make these formal languages as close as
    possible to a natural language, e.g. Swahili.
    I.e. we try to model a natural language, as
    perfectly as possible, in our grammars.
  • We write our grammars using xfst and lexc.

4
Concatenation can form Real Words
work talk walk
ing ed s
Root Language
Suffix Language
The concatenation of the Suffix language after
the Root language.
working worked works talking talked talks walking
walked walks
5
Concatenation can also form Bad Words
ing ed s
try plot wiggle
Suffix Language
Root Language
Raw Concatenation Result/Level/Language
trys tryed trying plots ploted ploting
wiggles wiggleed wiggleing tries
tried trying plots plotted plotting
wiggles wiggled wiggling Desired Final
Result/Level/Language
6
Inuktitut
  • Parismutnngaujumaniraqlauqsimanngitjunga
  • Pari mu nngau juma nira lauq si ma nngit tunga
  • Paris Paris
  • mut terminalis-case
  • nngau direction-to
  • juma want
  • niraq declare that
  • lauq past
  • si perfective
  • ma resulting state
  • nngit negative
  • junga 1P pres. indic

I never said that I wanted to go to Paris
7
Concatenative-Agglutinative (Aymara)
  • Lexical utamana-kapxaraki-iwa
  • Surface uta ma n ka p xa rak i wa
  • uta house (noun stem)
  • ma 2nd person possessive (your)
  • na in (case suffix)
  • -ka locative (also verbalizes)
  • p plural
  • xa perfect aspect
  • raki also
  • -i 3rd person present tense
  • wa topic (primary emphasis)
  • also they are in your house

8
Morphology
  • In most languages, morphemes are just
    concatenations of symbols from the alphabet of
    the language.
  • In most languages, words are just concatenations
    of morphemes.
  • But raw concatenation often gives us abstract,
    morphophonemic, not-yet-correct words.
  • There are alternations between the raw
    concatenations and the desired final words.
  • There are two challenges in natural-language
    Morphology
  • Morphotactics describe word-formation
  • Alternation describe mappings between raw
    concatenations and final forms
  • Both can be modeled and computed using
    finite-state methods

9
Transducers
  • Recall that finite-state transducers can map
    from one string of symbols to a different string
    of symbols.

c a n t a r
Verb PInd 1P Sg
c a n t ? ?
? o ? ?
We can also use transducers to map between
abstract, not-yet-correct forms (usually built by
simple concatenation) and correct forms.
w i g g l e
i n g
w i g g l ?
i n g
10
Regular Relations
  • A Regular Language is a set of strings, e.g.
    cat, fly, big .
  • An ordered pair of strings, notated ltupper,
    lowergt, relates two strings, e.g. ltwiggleing,
    wigglinggt.
  • A Regular Relation is a set of ordered pairs of
    strings, e.g.
  • ltcatN, catgt , ltflyN, flygt , ltflyV,
    flygt, ltbigA, biggt
  • Or
  • ltcat, catsgt , ltzebra, zebrasgt ,
    ltdeer, deergt, ltox, oxengt, ltchild,
    childrengt
  • The set of upper-side strings in a relation is a
    Regular Language.
  • The set of lower-side strings in a relation is a
    Regular Language.
  • A Regular Relation is a mapping between two
    Regular Languages. Each string in one of the
    languages is related to one or more strings of
    the other language.
  • A Regular Relation can be encoded in a
    Finite-State Transducer (FST).

11
Relations, Analysis and Generation
  • Given a transducer (relation), and a string, we
    can see the mappings of the relation via Analysis
    and Generation

Apply the transducer in a downward direction to
the upper-side string to perform Generation.
Upper-side string c a n t a r Verb PInd 1P
Sg
c a n t a r
Verb PInd 1P Sg
c a n t
o
Apply the transducer in an upward direction to
the lower-side string to perform Analysis.
Lower-side string c a n t o
12
Transducers encode Finite-State Relations
  • Let a Relation X include the ordered string pairs
  • ltcantarVerbPInd1PSg, cantogt,
  • ltcantoNounMascSg, cantogt
  • What is the upper-side Language of this Relation?
  • What is the lower-side Language of this Relation?
  • How can such a relation be encoded?
  • What do you get when you analyze the string
    canto?
  • What do you get when you generate from the string
    cantarVerbPInd1PSg?

13
Rules and Infinite Relations
  • One or both of the Languages related by a
    Relation can be infinite, e.g. the relation that
    relates lower-case words to their upper-case
    versions
  • lta, Agt, ltaa, AAgt, ltdog, DOGgt,

bB
cC
Apply this network in a downward direction to the
input string cad. What is the output?
aA
Etc, (assume arcs for all other symbols in the
alphabet)
dD
14
Alternation Rules
  • We will write finite-state rules to describe
    alternations between abstract morphophonemic
    words and well-formed surface words.
  • These rules compile into finite-state transducers
    (relations) that can be used to compute these
    mappings.
  • Typically the upper language of a rule FST is the
    Universal Language, the set of all possible
    strings.
  • Typically the lower language is like the upper
    language, except for the alternations controlled
    by the rule.
  • Strings that dont match the rule are mapped
    unchanged.

15
Rule Application Composition
  • Composition is an operation that merges two
    transducers vertically. Let X be a transducer
    that contains the single ordered pair lt dog,
    chiengt. Let Y be a transducer that contains
    the single ordered pair ltchien, Hundgt. The
    composition of X over Y, notated X .o. Y, is the
    relation that contains the ordered pair ltdog,
    Hundgt .
  • Composition merges any two transducers. If the
    shared middle level has a non-empty intersection,
    then the result will be a non-empty relation.
  • Rule application is done via composition.
  • Composition is a difficult topic that we will
    return to many times. Read pp 28-34 and do
    exercise 1.10.3 on page 37.

16
Review Basic Concepts
  • Language a set of strings/words
  • Regular Language a set of string/words that can
    be generated using concatenation, union,
    iteration and similar operations
  • Simple Finite-State Automaton (Acceptor) a
    finite-state machine that accepts/recognizes a
    regular language
  • Regular Relation a mapping between two regular
    languages
  • Finite-State Transducer (FST) a two-level
    finite-state automaton that maps between two
    regular languages (performs look-up and
    generation)

17
Regular Expressions
  • A compact formula for describing a regular
    language or regular relation.
  • The regular-expression language is a
    metalanguage.
  • Think of regular expressions as the programming
    language of xfst
  • Each implementation of regular expressions is
    slightly different (Python, Perl, emacs, )
  • We will have to learn the Xerox flavor of regular
    expressions as used in xfst.

18
Regular Expressions Denoting a Language
Regular Expression
describes
compiles into
Regular Language
Finite-State Automaton (acceptor)
accepts/recognizes
19
Regular Expression Denoting a Relation
Regular Expression
describes
compiles into
Regular Relation
Finite-State Transducer
maps
20
Introduction to xfst
  • xfst is an interface giving access to the
    finite-state operations (algorithms such as
    union, concatenation, iteration, intersection).
  • xfst includes a powerful and efficient
    regular-expression compiler.
  • xfst includes the lookup operation (apply up)
    and the generation operation (apply down) so
    that we can test our networks. For small
    examples, we can also print out all the words in
    the language using the command print words.
  • We have to learn the Xerox regular-expression
    metalanguage.

21
Xerox Regular-Expression Operators I
  • a a simple symbol
  • c a t a CONCATENATION of three symbols
  • c a t grouping
    brackets
  • ? denotes any single symbol
  • Noun or Noun
  • Verb or Verb
  • Adj or Adj
  • single
    symbols with multicharacter print names
  • (aka multicharacter symbols)
  • cat Beware this will be compiled by xfst as a
    single multicharacter symbol
  • cat explosion brackets equivalent to c a t

22
Xerox Regular Expression Operators II
  • 0 two ways to denote the empty (zero-length)
    string
  • Now, where A and B are arbitrarily complex
    regular expressions
  • A bracketing equivalent to A
  • A B union
  • (A) optional equivalent to A 0
  • A B intersection
  • A B concatenation (N.B. the space between A and
    B)
  • A - B subtraction

23
Xerox Regular-Expression Operators III
  • A Kleene star zero or more iterations
  • A Kleene plus one or more iterations
  • ? The Universal Language
  • A The complement of language A equivalent to
    ? - A
  • ? The empty language (i.e. it contains no
    strings at all, not even the zero-length
    string)
  • the literal plus-sign symbol
  • the literal asterisk symbol
  • and similarly for ?, (, ), , etc.

24
Denoting Relations
  • A .x. B the cross-product relates every
    string in A to every string in B, and vice
    versa e.g.
  • g o .x. w e n t relates go and went
  • ab shorthand for a .x. b
  • Pls shorthand for Pl .x. s
  • Pasted shorthand for Past .x. e d
  • Proging shorthand for Prog .x. i n g

25
Useful Abbreviations
  • A denotes the language of all strings that
    contain A equivalent to ? A ? , e.g.
  • b denotes the language of all strings that
    contain a b anywhere
  • A/B denotes the language of all strings in A,
    ignoring any strings from B, e.g.
  • a/b contains a, aa, aaa, ba, ab,
    aba, ...
  • \A any single symbol, minus strings in A i.e.
    ? - A , e.g.
  • \b denotes any single symbol, except a b
  • Beware NOT to be confused with
  • A the complement of A i.e. ? - A

26
Basic xfst interface commands
  • UnixPrompt xfst
  • xfstgt help
  • xfstgt help union net
  • xfstgt exit
  • xfstgt read regex d o g c a t
  • xfstgt read regex lt myfile.regex
  • xfstgt apply up dog
  • xfstgt apply down dog
  • xfstgt pop stack
  • xfstgt clear stack
  • xfstgt save stack myfile.fsm

27
xfst saves networks in a LIFO stack
  • xfstgt read regex d o g c a t
  • or
  • xfstgt read regex lt myfile.regex
  • causes the compiled network to be pushed onto
    the stack. When you type
  • xfstgt pop stack
  • the top network is popped off the stack and
    discarded. When you type
  • xfstgt apply up dog
  • the top network on the stack is applied in an
    upward direction (lookup) on the string dog,
    and the related string or strings are printed.
    When you type
  • xfstgt clear stack
  • the entire stack is popped and left empty. When
    you type
  • xfstgt save stack myfile.fsm
  • the contents of the stack are written in binary
    (compiled) form to the indicated file.

28
Setting Variables
  • xfstgt define Myvar
  • pops the top network off of the stack and saves
    it as the value of Myvar, which can be used in
    subsequent regular expressions
  • xfstgt define Myvar2 d o g c a t
  • assigns a value to Myvar2 without modifying the
    stack. It is equivalent to the two commands
  • xfstgt read regex d o g c a t
  • xfstgt define Myvar2
  • xfstgt undefine Myvar
  • undefines Myvar and recycles the memory

29
Using Variables in Regular Expressions
  • xfstgt define var1 b i r d f r o g d o g
  • xfstgt define var2 d o g c a t
  • You can now use var1 and var2 in subsequent
    regular expressions
  • xfstgt define var3 var1 var2
  • xfstgt define var4 var1 var2
  • xfstgt define var5 var1 var2
  • xfstgt define var6 var1 - var2

30
Performing network operations on the stack
  • xfstgt read regex d o g c a t
  • xfstgt read regex m o u s e r a t
  • xfstgt read regex d e e r s q u i r r e l
  • xfstgt union net
  • union net will pop its arguments off of the
    stack one at a time, perform the union operation,
    and push the result back onto the stack, leaving
    just one network on the stack. Enter the command
    words to see the resulting language.

31
The xfst Stack
  • Assume that two networks have already been pushed
    onto the stack.
  • If we then invoke a stack-based operation like
    union net, the xfst algorithm pops its first
    argument from the top of the stack, then the
    second argument
  • NetA NetB

NetA
NetB
32
Remember that the stack is last-in, first-out
(LIFO)
  • Ordered operations like minus net and compose
    net are often difficult to get right. E.g.
    Assume that we want to compute A - B on the
    stack. Try this
  • xfstgt define A d o g c a t m o u s e r
    a t
  • xfstgt define B d o g m o u s e e l e p h
    a n t
  • Now push the arguments onto the LIFO stack in the
    right order and invoke minus net. If you have
    a defined variable X, you can push its value onto
    the stack using
  • xfstgt push X
  • or
  • xfstgt read regex X
  • What is your answer? Type words to see the
    language of the resulting network.

33
The xfst Stack and Ordered Operations
  • To perform NetA NetB, the B net must be pushed
    onto the stack first, then the A net, so that
    they can be popped off in the reverse order.
  • When performing operations on the stack, try to
    visualize the stack itself.
  • NetA - NetB

NetA
NetB
34
A little concatenation example
  • xfstgt define Root w a l k t a l k w
    o r k
  • xfstgt define Prefix 0 r e
  • xfstgt define Suffix 0 s e d i n g
  • xfstgt read regex Prefix Root Suffix
  • xfstgt words
  • xfstgt apply up walking
  • Try to get the same result by starting with the
    same three definitions and then pushing them on
    the stack, invoking concatenate net to perform
    the concatenation. Remember that concatenation
    is an ordered operation.

35
xfst file types
  • Regex files contain only a regular expression,
    terminated with a semicolon and newline.
  • xfstgt read regex d o g c a t
  • xfstgt read regex lt myfile.regex
  • Binary files contain an already compiled network
    or networks, e.g.
  • xfstgt save stack myfile.fsm
  • xfstgt load stack myfile.fsm
  • Script files contain a list of xfst commands
    run them with source
  • xfstgt source myfile.script

36
The Simplest Replace Rules
  • Replace rules are a very powerful extension to
    the regular-expression metalanguage. Here is the
    simplest kind needed for the kaNpat and
    Portuguese-pronunciation exercises. The arrow -gt
    is typed as a hyphen followed by a right
    angle-bracket. The operator consists of two
    vertical bars typed together. The _ is the
    underscore.
  • Rule Schema upper -gt lower
  • upper -gt lower leftcontext _ rightcontext
  • E.g.
  • xfstgt read regex s -gt z a e i o
    u _ a e i o u
  • xfstgt apply down casa
  • What is this rule intended to do? What comes out?

37
kaNpat example
  • Assume a language that joins morpheme kaN (with
    an underspecified nasal N) and morpheme pat into
    the underlying or morphophonemic form kaNpat.
    This language then has alternation rules that
    dictate that N, when followed by p, gets realized
    as m. And p, when preceded by m, gets realized
    as m. The derivation looks like
  • Underlying input kaNpat
  • Rule1 N -gt m _ p
  • Output of Rule1 kampat
  • Rule2 p -gt m m _
  • Output of Rule2 kammat
  • The composition operation (.o.) reduces the
    derivational cascade of transducer networks into
    a single transducer network.

38
Your first cascade of rules
  • xfstgt define Rule1 N -gt m _ p
  • xfstgt define Rule2 p -gt m m _
  • xfstgt read regex Rule1 .o. Rule2
  • xfstgt apply down kaNpat
  • What is the output?
  • Now restart (with clear stack), define the two
    Rules as shown above, push them on the stack in
    the right order, and perform the composition on
    the stack using compose net. What is your
    result? (Remember that the networks must be
    pushed in the right order.)

39
Rule Abbreviations
  • Multiple left-hand sides, separated by commas
  • b -gt p, d -gt t, g -gt k _ ..
  • Multiple right-hand sides, separated by commas
  • e -gt i _ (s) .. , .. p _ r
  • Use .. to refer to either the very beginning or
    the very end of a word.

40
Typing Accented Letters in xemacs
  • The COMPOSE key is to the right of the space bar.
  • COMPOSE a yields ä
  • COMPOSE a á
  • COMPOSE a à
  • COMPOSE a â
  • COMPOSE a ã
  • COMPOSE c , ç

41
A Trick for Testing Multiple Words
  • The exercise will ask you to write a cascade of
    rules that map orthographical strings to
    something more like a phonemic notation.
  • Type the test words into a file, e.g. wordlist
  • Compile your rules, compose them, and put the
    result on The Stack
  • Test all the words using the following syntax
  • xfst1 apply down lt wordlist

42
Assignment
  • Read Chapter 2 (The Systematic Introduction) when
    you can.
  • For hands-on practice right now start reading
    Chapter 3, doing the examples as you go along.
  • Do the kaNpat exercise in section 3.5.3 and the
    Southern Brazilian Portuguese exercise in 3.5.4
    (p. 134).
  • Progress to Bambona (p. 140) and Monish (p. 146)
    if you can.
Write a Comment
User Comments (0)
About PowerShow.com