Creating annotated corpora at the Alaska Native Language Center - PowerPoint PPT Presentation

1 / 44
About This Presentation
Title:

Creating annotated corpora at the Alaska Native Language Center

Description:

Established in 1972 by state legislation as a center for documentation and ... (chickadee story) by Katherine. Trefon. Chulyin Sukdu'a (raven story) by Alexie Evan ... – PowerPoint PPT presentation

Number of Views:25
Avg rating:3.0/5.0
Slides: 45
Provided by: ethnoerUn
Category:

less

Transcript and Presenter's Notes

Title: Creating annotated corpora at the Alaska Native Language Center


1
Creating annotated corpora at the Alaska Native
Language Center
Worst practices in linguistic annotation
Gary Holton, University of Alaska
Fairbanks Andrea Berez, Wayne State University
EthnoER, Melbourne, 15 Feb 2006
2
Overview
  • about ANLC
  • approaches to annotation at ANLC
  • an annotation case study
  • some desiderata for an annotatable corpus

3
Alaska Native Language Center
  • Established in 1972 by state legislation as a
    center for documentation and cultivation of the
    state's 20 Native languages
  • Staff includes
  • 62 linguists
  • 4 language instructors
  • 6 research assistants
  • (1 archivist?) 1 editor
  • 1 administrative asst

4
Alaska Native Languages
  • 11 Athabascan Languages
  • 4 Eskimoan Languages
  • Aleut
  • Eyak
  • Tlingit
  • Haida
  • Tsimshian

5
Languages are endangered
  • Numbers of speakers
  • Central Yup'ik 10,000
  • Inupiaq 3100
  • all 11 Athabascan lt2000
  • Eyak 1
  • Age of youngest speakers
  • in many cases gt80

6
ANLC mission
  • revitalization and maintenance
  • language teaching
  • documentation
  • publication
  • archiving

7
Language revitalization
  • on-campus classes
  • summer language institutes
  • teaching training
  • technology training

8
Language teaching
  • university classes
  • Central Yupik
  • Siberian Yupik
  • Inupiaq
  • Gwichin
  • Koyukon
  • Tanacross
  • individual study
  • Alutiiq, Tlingit,

9
Language documentation
  • dictionaries
  • grammars
  • texts
  • pedagogicalmaterials

10
Publications
  • Dictionaries
  • Grammars
  • Collections Of Stories
  • Map
  • Phrasebooks
  • Curriculum Guides
  • Textbooks
  • Beginning Readers

11
Language archiving
  • Primary linguistic data archive for the 20 Alaska
    Native languages
  • Comprehensive -- nearly everything written in or
    about Alaska Native languages
  • Focus on unpublished manuscripts, field notes and
    recordings
  • Items include
  • print (gt15,000 items)
  • audio (gt5000 tapes)
  • digital files (??)

12
Types of data
  • field notes
  • texts
  • manuscripts
  • recordings
  • pedagogical materials
  • lexical data
  • comparative wordlists
  • etymological wordlists
  • place names

13
Related data are disassociated
  • traditional focus on print materials
  • transcript treated as primary
  • related audio can be difficult to locate
  • transcripts treated as isolated events
  • not grouped within languages
  • not related across languages
  • some data not properly archived
  • e.g., photos
  • other data not archived at all
  • e.g., speaker biographies

14
Preservation
  • cataloging
  • re-foldering, boxing
  • digitization
  • long-term storage at Arctic Region Supercomputing
    Center

15
Access
  • primary users areindigenous community members
  • physical access in Fairbanks

16
Web portal Qenaga.org
  • piloted for Denaina
  • currently being extended to Yupik

17
Annotation at ANLC
  • individual researchers using annotation in new
    research
  • pedagogical applications
  • new focus on linking archival text and audio
  • often viewed as a presentation format

18
Shoebox
\ref GW021 \t naatthi nachedahee ey sh
nxedl s_h_i eedah de. \mr naa-tthi
nachedahee ey sh n-xedl s_h_i ee-dah
de \mg INTER-watergtPUNCT game.warden that meat
2SG-sled in M-sit if \f if that meat is sitting
in your sled down there, the game warden, \ref
GW022 \t nachedahee ey shax xals_en ts
nintahtel sixunteh.Ü \mr nachedahee ey
shax xals_en ts ni-n-t-ah-teÂ-e
si-xu-n-teh.Ü \mg game.warden that jail to
TERM-2SG-FUT-M-H-send-NOM THM-AREA-M-be \t that
game warden will surely send you to jail \ref
GW023 \t xeyehnii toteey stsuu du
tsanielteh ts \mr xe-y-eh-nii toteey
stsuu du tsa-niel-teh ts \mg
HUM.PL-4OBJ-H-saygtIMPF even 1SG-grandma thus
THM-THM-L-angry CONJ \t they said that to her and
then grandma became angry \ref GW024 \t
Ûxuteey, \mr xuteey \mg okay \f Ûokay,
19
Presentation annotation (1)
  • Johnson 2005
  • printed interlinear text
  • accompanyingaudio CD

20
Presentation annotation (2)
21
Simple HTML markup
lttr valign"top"gtlttd width"100"gt ltOBJECT
CLASSID"clsid02BF25D5-8C17-4B23-BC80-D3488ABDDC6
B" HEIGHT"15" WIDTH"100" CODEBASE"http//www.
apple.com/qtactivex/qtplugin.cab"gt ltPARAM
name"SRC" VALUE"audio/animal18-01.mp3"gt ltPARAM
name"AUTOPLAY" VALUE"false"gt ltPARAM
name"CONTROLLER" VALUE"true"gt ltEMBED
SRC"audio/animal18-01.mp3" HEIGHT"15"
WIDTH"100" AUTOPLAY"false" CONTROLLER"true"
PLUGINSPAGE"http//www.apple.com/quicktime/down
load/"gt lt/EMBEDgt lt/OBJECTgt lt/tdgt lttdgtltspan
class"tc"gt Jaan ch'e dleg. lt/spangtltbrgt ltspan
class"en"gt This is a squirrel. lt/spangtlt/tdgt lt/trgt
22
Funkier HTML markup
ltobject classid"clsid02bf25d5-8c17-4b23-bc80-d34
88abddc6b" width"360" height"256"
codebase"http//www.apple.com/qtactivex/qtplugin.
cab"gt ltparam name"src" value"../s_under_video/xu
ns_uu.mov" /gt ltparam name"autoplay" value"true"
/gt ltparam name"controller" value"true"
/gt ltembed src"../s_under_video/xuns_uu.mov"
width"360" height"256 autoplay"true"
controller"true" pluginspage"http//www.apple/
com/quicktime/download/"gt lt/embedgt lt/objectgt ltbr
/gt ltcentergtltimg src"../s_under_images/xuns_uu.gif
"gtlt/centergt ltcentergtltigtit (area, weather) is
good, nicelt/igtlt/centergt
lttrgtlttd align"center"gt lta href"javascriptpopup
('s_under_text/xuns_uu.html')"gt ltimg
src"s_under_images/xuns_uu.gif"
border"0"gtlt/agt lt/tdgt lttd align"center"
class"gloss"gtit (area, weather) is good,
nicelt/tdgt lt/trgt
23
Problems
  • entirely ad-hoc process
  • each product created independently from scratch
  • no archival (aligned) version
  • no relations between different texts/languages

24
Toward less ephemeral annotation
  • Denaina (Athabascan) case study
  • Goal link media and annotation in a way which
  • facilitates user-friendly access
  • follows BP recommendations for archiving

25
Existing documentation
  • typescript texts (Tenebaum 1976)
  • original audio recordings

26
Chggagga Sukdua (chickadee story) by
KatherineTrefon
27
Chulyin Sukdua (raven story) by Alexie Evan
28
Metadata
no relation between original recording and
individual story
29
Product Goals
  • Audio available in one-line increments and as a
    whole
  • Support for non-Unicode-enabled machines
  • Metadata for each story included
  • use only free, or very cheap, open-source
    software in the production of the CDs
  • use readily available software for the viewing of
    the CDs (web browsers, QuickTime, FlashPlayer)

30
Further Goals
  • design a workflow that can be easily mimicked by
    community members with little training
  • create archival version of the alignment in
    accordance with BP recommendations

31
What we did
  • assembled transcriptions and audio files
  • segmented audio using ELAN
  • edited .eaf file to enter text
  • built HTML using XSLT to transform .eaf files
  • chunked audio and converted to Flash
  • developed training procedure (in progress)

32
Project workflow
  • gather existing media
  • previously digitized Denaina recordings, some up
    to 40 years old
  • transcription and translation
  • WordPerfect file
  • Hand-written transcription
  • images from various sources (Park Service, UA
    Archive, private)

33
Audio/Text alignment
  • time segmentations creating in ELAN
  • but first learn to use ELAN
  • edit .eaf file and paste in existing Denaina and
    English text

34
ELAN
35
Investigate presentation audio format
  • SMIL explored but not adopted
  • two methods employed
  • single Quicktime file with start/stop times
  • multiple Flash files, one for each line
  • Investigate possible presentation formats in
    order to determine the path to get there.
  • Learn all about SMIL, QuickTime, XSLT, a new
    LINGUIST List interface, etc...
  • Chose HTML built from XSLT, so...

36
Transform ELAN file
  • First, add more markup
  • ltDENTITLEgt , ltENGTITLEgt , ltSPEAKERgt
  • Use XSLT to generate basic HTML structure
  • originally done awkwardly using ColfFusion server
  • edit HTML output to insert
  • timecodes (for Quicktime version)
  • .swf file names (for Flash version)

37
XSLT
38
Quicktime time codes
  • Advantages
  • no audio editing
  • Disadvantages
  • time codes had to be copied by hand from .eaf
    header
  • many machine must be hand configured to use QT
    plugin
  • pages with many embedded QT files will crash
    browser

39
Audio chunking
  • originally done by hand using time codes in .eaf
    header
  • later automated with a clip extractor (using Tcl)
  • batch conversion of clips to .swf files

40
The product
41
Complications
  • no straightforward method to create presentation
    format from ELAN XML
  • chunking audio is cumbersome
  • conversion to .swf files is cumbersome
  • images have to added later -- not linked to
    text/audio
  • procedure is complex -- results of training have
    been mixed (so far)
  • Still no archival format!

42
Future desiderata
  • single interface to search across multiple texts
  • i.e., corpus browser
  • direct linkage to lexical database (cf. Lachlers
    Haida)
  • ways to link non-linguistic data (e.g., photos)
  • richer participant metadata

43
Thanks to
  • National Endowment for the Humanities
  • National Science Foundation Office of Polar
    Programs
  • National Science Foundation Documenting
    Endangered Languages initiative
  • University of Alaska Foundation
  • Australian Research Council HCSNet

44
Qua
gary.holton_at_uaf.edu andrea_at_linguistlist.org
Write a Comment
User Comments (0)
About PowerShow.com