Transforming the Representation of Lexical Knowledge - PowerPoint PPT Presentation

About This Presentation
Title:

Transforming the Representation of Lexical Knowledge

Description:

Regular dictionary users (especially, compilers! ... If a dictionary is going to be made for a speech community, then the people in ... – PowerPoint PPT presentation

Number of Views:113
Avg rating:3.0/5.0
Slides: 68
Provided by: Mann5
Learn more at: https://nlp.stanford.edu
Category:

less

Transcript and Presenter's Notes

Title: Transforming the Representation of Lexical Knowledge


1
Transforming the Representation of Lexical
Knowledge
  • Christopher Manning
  • University of Sydney
  • http//www.sultry.arts.usyd.edu.au/cmanning/

2
Project Objectives
  • Aims of the project
  • examining the richness of lexical structure, in
    particular the connotational and figurative use
    of words
  • providing innovative ways for representing a
    dictionary, through creative use of the medium of
    computers
  • augmenting dictionaries from corpora
  • to be able to provide practical educationally
    useful programs as a result (at low labor cost)
  • Main initial target an interactive front end for
    exploring or using the Warlpiri dictionary.

3
Acknowledgements
  • Ken Hale, Mary Laughren, Jane Simpson, Robert
    Hoogenraad, David Nash, Kay Ross
  • Kevin Jansz, Nitin Indurkhya, Katrina Avila
  • Susan Poetsch, Miriam Corris, John Henderson
  • (and many others)

4
Talk Outline
  • The research agendas
  • Dictionary usability and usefulness
  • Kirrkirr A Warlpiri dictionary browser
  • Underlying data
  • User interface and visualization
  • Corpus enrichment for terminology sets
  • User study

5
Research Program Lexicon
  • A lexicon is not just words but a vast network of
    associations between words and within and across
    the concepts represented by words
  • The aim of this work is to provide people with a
    better understanding of this conceptual map.
  • E.g., patterns of figurative extension in a song
    about a stockman driving a car glass is used
    first for the windscreen, and then metaphorically
    for sexual attraction, using a systematic
    pattern of figuration between shining and sex

6
Lexicon (cont)
  • Traditional paper dictionaries offer very limited
    ways for making such networks visible
  • On a computer, one can imagine all sorts of ways
    of bringing out such relationships

7
Research Computational Lexicography
  • Dictionaries on computers are now commonplace
  • But there has been little attempt to utilize the
    potential of the new medium
  • Goal fun dictionary tools that are effective for
    language learning, browsing, and research
  • Special interest dictionaries for minority
    languages. Here economic, motivational, and user
    support reasons all point to an important role
    for computers.

8
MRD Structure
  • The internal structures of current Machine
    Readable Dictionaries usually merely mimic the
    structure of the printed form (Boguraev 1990)
  • Some work, notably WordNet (Miller 1995) has
    involved a fundamental rethinking of dictionary
    content and organization (here, organization via
    synsets which are related via links of part,
    subkind, opposite)
  • But this research hasnt been taken to users.

9
Research Program Education
  • Dictionary structure and usability are often
    dictated by professional linguists, while the
    needs of others (speakers, semi-speakers, young
    users, second language learners) are not met
  • Weiner (1994) The initial purpose of the OED
  • to create a record of vocabulary so that English
    literature could be understood by all. But
    English scholarship grew up and lexicography grew
    with it inevitably parting company with the man
    in the street.
  • Challenge is to avoid this.

10
Dictionary usefulness and usability
  • Kegl (1995) Machine-Readable Dictionaries and
    Education
  • Originally, this paper was intended as a survey
    of educational applications using MRDs. As far as
    I have been able to determine, no such
    applications currently exist
  • Standard dictionaries are reference works,
    ill-suited for use as learning tools
  • Studies of American dictionary skills training
    show that many tasks achieve little in the way of
    education (but do teach word lookup!)

11
Educational value of dictionaries
  • However derived lexical information is useful!
  • Think of a high school foreign language textbook
  • terminology sets
  • pictures with parts named
  • vocabulary lists
  • word explications
  • Major issue
  • Not many people sit around reading dictionaries
    need something fun

12
Data on usability evaluating a paper dictionary
  • Study of paper dictionary usability by Susan
    Poetsch, tested using Alawa dictionary (draft by
    Margaret Sharpe)
  • In community, old people are very concerned to
    keep language strong, and help as volunteers in
    bilingual education. They are keen on dictionary
  • However, they lack the literacy skills to use it
  • Susan worked with people aged 2550
  • Since volunteers, probably better than average
    literacy skills for the community

13
Findings
  • Not very literate A big dictionary is
    overwhelming to someone with emerging literacy
    skills
  • People knew words are ordered but could not use
    ordering effectively (restart or flick randomly)
  • Often around 3 minutes a word lookup
  • People lost place in page regularly
  • An overcrowding of information is confusing
  • One word correspondences are easiest for users,
    but often unrealistic linguistically
  • Subentries were confusing part of speech
    puzzling

14
Findings (2)
  • Regular dictionary users (especially, compilers!)
    grossly underestimate the time they have spent
    becoming familiar with dictionary structure
  • If a dictionary is going to be made for a speech
    community, then the people in that community need
    to feel confident in using it.
  • Teachers felt that the draft dictionary is too
    long and detailed for school use
  • Conclusion These people need a different
    dictionary (My First Alawa)
  • Would probably be used by adults as well as kids

15
Initial focusKirrkirr a Warlpiri browser
  • Warlpiri is an Australian Aboriginal language
    spoken in the Tanami desert (NW of Alice)
  • A computer interface for browsing the Warlpiri
    dictionary.
  • Rich lexical materials have been collected by
    linguists over decades (Hales fieldwork from
    1959 on, MIT Lexicon Project in the 1980s)
  • The results still havent been produced in a
    format usable by the community (only printouts)
  • Previous computer projects have faltered

16
Past Problems
  • At least 15 years have passed during which the
    Warlpiri dictionary could have been tested,
    people trained in dictionary use, and the
    dictionary improved with user input, but all that
    has been produced is one badly formatted raw
    paper printout
  • Huge amounts of human labor have been expended
  • Information systems 101 need to deliver, and
    provide the kind of process automation to make
    production and revisions easy

17
Our educational goals
  • Aim at school kids
  • Information seeking is a complex process which
    is often not attended to in K-12 education
    (Wallace et al. 1998)
  • Provide learner supports for getting started with
    dictionaries
  • Adaptable interface can cater to different needs
  • Support for active reading by allowing note
    taking
  • An interface where you can see words, but are not
    required to know words

18
Target user community
19
Kirrkirr A Warlpiri dictionary browser
  • (Jansz 1998 Jansz, Manning and Indurkhya 1999)
  • An environment for the interactive exploration of
    dictionaries.
  • The design is general, but our current work has
    just been with Warlpiri (Arrernte coming soon!)
  • Attempts to more fully utilize graphical
    interfaces, hypertext, multimedia, and different
    ways of indexing and accessing information
  • Written in Java, it can either be run over the
    web high bandwidth or run locally (here Javas
    main advantage is cross-platform support).

20
Specific goals
  • An interactive environment that encouraged
    exploration easy and fun to use
  • Reduction of the dependence on alphabetical order
  • Catering to the needs of different user groups
    (kids, teachers, professionals)
  • Flexible enough to display appropriate
    information in appropriate ways depending on user
    level

21
Overview
  • Kirrkirr provides various modules
  • Graph layout of word relationships
  • Formatted dictionary entries
  • Semantic domain browsing
  • A notes facility for jotting in the margin
  • Multimedia audio, pictures
  • Advanced searching interfaces
  • others in planning colors, figuration patterns
  • These attempt to cater to users with different
    competence levels

22
(No Transcript)
23
The lexical database
  • Existing materials are stored in an ad hoc format
    of markup using backslash codes with some (rather
    odd) nesting of structural tags
  • These were converted to XML using an
    error-correcting stack-based parser (written in
    PERL).
  • The inconsistency and flexibility of dictionary
    entries actually made this a surprisingly
    difficult task.
  • But parser tries to impose data integrity
  • Use of XML gives a clear structure to the data,
    and makes available many (free) tools

24
XML
  • XML a descendant of SGML for structured markup
    of text
  • XML separates the structure of the data from its
    presentation
  • Much of the recent enthusiasm for XML has
    centered around representing simple and rigid
    structures such as database records
  • The rich hierarchical and variable structure of
    dictionary entries is really more what something
    like XML excels at!
  • Result remains a portable, tangible text file

25
Alternative a database
  • The obvious thing for storing a lot of data
  • Has clear advantages structure, indexing, query
    language, relationships, integrity.
  • Many people have suggested using a database for
    lexical data and some have actually done it
    (IITLEX, Austin and Nathan)
  • But in general lexicographers oppose the
    rigidity, and, in practice, standard relational
    databases are quite ill-suited to dictionaries

26
Problems
  • Dictionary entries vary enormously
  • word cross-reference
  • word POS gloss example translation
  • word dialects sense-1 POS1 definition gloss
    example translation example translation sense-2
    POS2 dialect definition example translation
    subentry-word gloss synonym etymology
  • Data is fragmented
  • Same element can appear at many levels (dialect,
    crossreference, )
  • Dictionaries are only loosely structured
  • Database model is inflexible to extending the
    dictionary structure
  • Lessens portability
  • Answer an object database

27
XML indexing
  • XML is a median between the structure, indexing,
    etc. of a database, and the freedom of a word
    processor.
  • To improve speed, an ad hoc index to the XML file
    is built, and can be used for rapid headword and
    gloss lookup and indexing which parts of the XML
    file to process.

28
Visualization of dictionary information
  • For applications with simple textual content
    behind them, there is little that can be done but
    an on-line reflection of a printed page
  • But we want more than just definitions of words
    we want to know their relationships to other
    words, and the patterning in these relationships
  • In a computational approach, can mediate between
    the lexical data and the user
  • The interface can select from and choose how to
    present information (according to the users
    preferences) in many different ways

29
Previous work
  • Current systems present the search-dominated
    interface of classic Information Retrieval
    systems you type a word in a search box
  • Results try to mimic, but are generally inferior
    to, the printed version of the dictionary
  • Good feature rapid searching
  • These systems do little to utilize the
    captivating qualities of computers
    interactivity, user control and adaptability
    (Brown 1985).

30
Previous work (2)
  • Only effective when user has a clearly specified
    information need even here, we are ignoring the
    distinction between information gained and
    knowledge sought (Sharpe 1995)
  • Lack browsing, and chances for incidental or
    curiosity driven learning
  • Lack tangibility and situatedness of paper
    ineffective for getting an idea of a collection
  • We wish to exploit the essence of hypertext,
    which is click to explore browsing

31
Previous work (3)
  • Little research work (in corpus linguistics,
    visualization etc.) on dictionary visualization
  • WordNet built a rich network of relationships,
    which fundamentally departed from the paper
    dictionary tradition, and has been used in many
    computational projects
  • However very little has been done in the way of
    interfaces that make these relationships visible
    and intelligible to users.
  • Graphical representations seem particularly
    important given our target users.

32
MRD Interfaces WordNet
33
Graph-based visualization
  • There is a little previous work on graphical
    representations of dictionaries
  • For instance, the visual-thesaurus by plumbdesign
    derived from WordNet
  • But it is also a good demonstration of how
    chaotic and confusing graphical interfaces can
    become.

34
Perils of visualization
35
Graph-based visualization
  • (Jansz 1998 Jansz, Manning and Indurkhya 1999)
  • Classic graph layout problem
  • Adapts work by Eades et al. (1998) and Huang et
    al. (1998) on visualization and navigation of WWW
    document linkages
  • Uses the spring algorithm. Big advantage is that
    it is an iterative updating algorithm, and so
    gives an easy interactivity
  • it wiggles and people can play with it.
  • Clarity and simplicity of graph Software
    maintains a set of focus nodes to prevent
    overcrowding

36
Educational advantages
  • Alphabetical order is important, but
  • A web of words offers other effective
    opportunities for learning
  • A student can opportunistically explore words
    that are related in various ways
  • Important semantic relationships can be
    understood

37
Kirrkirr network display
38
Kirrkirr network display
39
Formatted dictionary entries
  • Are produced automatically from the XML by using
    XSL (a style language)
  • XSL allows easy modeling of some user
    preferences.
  • Most trivially, one can leave out information
    such as part of speech, or detailed definitions
  • This is useful as many users find information
    overload quite confusing and demotivating
  • Can produce bilingual or monolingual dictionary
  • Opportunities for various output styles, and
    formats such as RTF or TeX for printing.

40
Formatted dictionary entries
41
Rich typology of link types
  • The semantically rich types of linkages present
    in a dictionary (synonym, antonym, hyponym,
    subheadword, variant, coverbs, ) solves one of
    the major problems of the web we have many link
    types with a clear semantic interpretation
  • Use consistent color-coded text and edges to show
    these link types
  • Gives a richer browsing experience
  • Can tell where you are going before clicking

42
Browsing
  • Work (at PARC and elsewhere Pirolli et al. 1996)
    has stressed role for browsing as well as
    searching in information access
  • It provides a context for learning
  • We provide browsing in several ways
  • conventional hypertext
  • but with rich semantically-interpreted links
  • their color-coding matches network edges
  • network-based display of words
  • browsing through semantic domains

43
Semantic Domains
  • Alphabetical order is one indexing strategy, but
    there are many others
  • Most requested is ability to find things by
    semantic domains e.g., food, manufactured items.
  • Essentially the nouns structure of WordNet, or
    the classical KR ISA hierarchy
  • We can exploit the domain info in the dictionary

44
Semantic Domains (Katrina Avila)
45
Other components
  • Multimedia (currently pictures and audio)
  • Can hear pronunciations / see objects
  • Im keen to put in videos of Warlpiri sign
    language
  • Advanced search page
  • search various fields, regular expressions, etc.
  • Notes one can annotate dictionary entries (to
    correct or personalize)

46
Simple features
  • Show the alphabet
  • The list on the left gives concreteness, and
    tangibility
  • people can start with one of those words
  • One can just type a few letters and then look at
    the list traditional benefit of paper
    dictionary
  • English lookup can be helpful when Warlpiri
    spelling fails

47
Fuzzy spelling
  • We expected problems with spelling
  • Literacy skills based mainly in English, which
    doesnt transfer well
  • Different sounds in Warlpiri
  • Software employs fuzzy spelling which allows
    generous matching ignoring many distinctions
  • done on the fly with FSMs, rather than using the
    SOUNDEX strategy
  • Still not enough e.g., one kid wrote wanapy
    when wanting warnapari dingo, the end part of
    which knocked us out.

48
Adding more links Terminology sets
  • Related words often arent in same domain
  • Rather, words associated with a topic
  • E.g., a dance has an associated set of words
    clearing the ground, decorating with ochres,
    leaves, and feathers, singing, dancing
  • A concept useful to native speakers and learners
  • Such cultural information is hard to learn, and
    not normally in dictionaries or thesauri
  • Question can terminology sets be derived
    automatically from appropriate corpora?

49
Terminology sets
  • Approach terminology will be determined as
    medium range collocations
  • Corpus collection of Warlpiri stories, letters,
    books, fieldwork notes, etc. I have slightly
    over 1/4 million words of online Warlpiri
  • This is a large proportion of Warlpiri available
    in textual form the difference between fieldwork
    corpora and StatNLP corpora.

50
Building and assessing
  • I stemmed words (to maximize fuel) Warlpiri
    also has clitics that attach to words
  • Using a Kay/Röscheisen-style approximate
    morphological analysis vowel harmony
  • Collocational bonds were assessed using Dunning
    (1993)s method of loglikelihood ratios

51
Results
  • For some topics (including dances,
    unfortunately), one couldnt get much out (too
    little data). Cf. Church and Hanks on strong
    tea.
  • But for others, works well. E.g., karli
    boomerang
  • ngurrjumani make, fix, repair
  • jarnti carve, trim
  • kijirni throw
  • karaly(pa) smooth
  • kurduju shield
  • maparni paint

52
Results cont.
  • warrirni seek, search, try to find
  • kurdu child, baby, young, youth
  • kurrupurda boomerang (a specific type)
  • jarntu pet dog
  • As often, the evaluation criteria are unclear,
    and susceptible to just-so stories. (Do people
    tend to sit around with their dog while carving
    their boomerang? Im not sure.)

53
Another e.g.
  • pangurnu digging scoop
  • pangurnu
  • pili small coolamon/digging scoop
  • rdaku hole in the ground
  • kaninjarra downwards
  • pangirni dig, produce cavity
  • mulju soak in soft earth (dig for water)
  • karlaja foot end of sleeping area
  • pirrkirni scrape
  • yirrarni put down

54
User study problems
  • Since at present there is no dictionary available
    except the printed out database complete with
    markup codes, it was hard for many people to
    judge the use of the interface, since there was
    no point of comparison.
  • First impressions only It would have been good
    to let people try it out at their leisure, but
    unfortunately not possible (NT Ed all Macintosh,
    MRJ 2.1 shipping deadlines slipped past our study
    date)

55
User study
  • Mim Corris (Yuendumu, Willowra)
  • User testing with primary and (lower) secondary
    students
  • Comments from teachers, other adults etc.
  • Purely qualitative observational study of
    dictionary use
  • (Doing anything much else would be difficult.)

56
Teachers
  • Very enthusiastic
  • Role in encouraging kids to learn Warlpiri
  • See uses for it in classroom
  • Would teach dictionary skills and concepts
  • Would also help teachers learn Warlpiri
  • theyd browse in it and learn things
  • Liked spatial layout
  • Could use as a basis for classroom activities
    (better with some further development games and
    puzzles)

57
Elementary school kids
  • One major benefit is that it was on computer.
  • It maintained interest
  • They were enthusiastic about the computer side
    of things and negotiated the interfaces various
    windows easily
  • e.g., wanted back button
  • Sometimes, working on sense relations and
    definitions was of less interest than moving
    things
  • Word list was found helpful (can compensate for
    poor spelling)

58
Older children
  • More thoughtful had dictionary experience
  • Still really liked whole word list
  • Could use and liked synonyms and antonyms
  • Promoting subentries to entries appeared very
    effective People enjoyed exploring and
    explaining relation of derived terms to main word
    (even if sometimes folk etymologies?)
  • The semantically uninterpreted cf link was still
    confusing
  • High school girls wanted to spend time with it!

59
Adult literacy workers
  • Less interested in graphical interface
  • Mainly interested in looking at definitions
  • Started discussing and disagreeing with them
    immediately
  • although they had and used paper printout, first
    real chance to see what was there?
  • They wanted to, and were able to, annotate the
    definitions with notes

60
Room for improvement
  • More colorful!
  • Make more interactive theres not enough that
    kids can create
  • Some cleaning up of the user interface less
    steps for searches, etc.
  • Adding in more views to the dictionary
  • e.g., search by color

61
Conclusions
  • Kirrkirr is just a prototype of what one can do
    to visualize dictionaries
  • Wed like to go beyond that and start visualizing
    patterns of metaphor and sense extensions in
    dictionaries
  • But it does show how a lot can be done to provide
    much more in the way of a dictionary interface
    which mediates between well-structured data and
    users needs for searching/browsing and
    presentation

62
(No Transcript)
63
  • High quality dictionaries without excessive
    manual labor.
  • Terminology sets
  • Richer hypertext and multimedia

64
  • Traditional dictionaries tend not to capture
    collocational knowledge.
  • For a somewhat largish window size, collocations
    seem a good way of getting at the notion of a
    terminology set.

65
Overview
  • A project to provide an engaging, interactive
    computer front end for the Warlpiri dictionary.
  • Research goals
  • Effective innovative use of computer medium
  • Especially by dictionary visualization
  • Augmenting dictionaries from corpora
  • Assessing educational use of dictionaries
  • Educational and practical goals
  • Deliver it information systems 101
  • Incidental learning and regular lookup

66
  • Cf. also, Atkins and Varantola (1997) IJLexicog.

67
XML vs. Databases
  • Flexible and hierarchical structure is easy
  • There are tools for parsing and querying XML, but
    much less developed
  • Out of the box, one is basically using grep,
    perhaps with structure sensitivity
  • Portable text file
  • Both flexible and hierarchical structure are
    difficult and can involve use of many tables
  • A standard query language makes information
    access straightforward
  • Databases provide a lot of technology for
    indexing to allow fast retrieval
  • Less portable/tangible
Write a Comment
User Comments (0)
About PowerShow.com