Title: Creating annotated corpora at the Alaska Native Language Center
1Creating annotated corpora at the Alaska Native
Language Center
Worst practices in linguistic annotation
Gary Holton, University of Alaska
Fairbanks Andrea Berez, Wayne State University
EthnoER, Melbourne, 15 Feb 2006
2Overview
- about ANLC
- approaches to annotation at ANLC
- an annotation case study
- some desiderata for an annotatable corpus
3Alaska Native Language Center
- Established in 1972 by state legislation as a
center for documentation and cultivation of the
state's 20 Native languages - Staff includes
- 62 linguists
- 4 language instructors
- 6 research assistants
- (1 archivist?) 1 editor
- 1 administrative asst
4Alaska Native Languages
- 11 Athabascan Languages
- 4 Eskimoan Languages
- Aleut
- Eyak
- Tlingit
- Haida
- Tsimshian
5Languages are endangered
- Numbers of speakers
- Central Yup'ik 10,000
- Inupiaq 3100
- all 11 Athabascan lt2000
- Eyak 1
- Age of youngest speakers
- in many cases gt80
6ANLC mission
- revitalization and maintenance
- language teaching
- documentation
- publication
- archiving
7Language revitalization
- on-campus classes
- summer language institutes
- teaching training
- technology training
8Language teaching
- university classes
- Central Yupik
- Siberian Yupik
- Inupiaq
- Gwichin
- Koyukon
- Tanacross
- individual study
- Alutiiq, Tlingit,
9Language documentation
- dictionaries
- grammars
- texts
- pedagogicalmaterials
10Publications
- Dictionaries
- Grammars
- Collections Of Stories
- Map
- Phrasebooks
- Curriculum Guides
- Textbooks
- Beginning Readers
11Language archiving
- Primary linguistic data archive for the 20 Alaska
Native languages - Comprehensive -- nearly everything written in or
about Alaska Native languages - Focus on unpublished manuscripts, field notes and
recordings - Items include
- print (gt15,000 items)
- audio (gt5000 tapes)
- digital files (??)
12Types of data
- field notes
- texts
- manuscripts
- recordings
- pedagogical materials
- lexical data
- comparative wordlists
- etymological wordlists
- place names
13Related data are disassociated
- traditional focus on print materials
- transcript treated as primary
- related audio can be difficult to locate
- transcripts treated as isolated events
- not grouped within languages
- not related across languages
- some data not properly archived
- e.g., photos
- other data not archived at all
- e.g., speaker biographies
14Preservation
- cataloging
- re-foldering, boxing
- digitization
- long-term storage at Arctic Region Supercomputing
Center
15Access
- primary users areindigenous community members
- physical access in Fairbanks
16Web portal Qenaga.org
- piloted for Denaina
- currently being extended to Yupik
17Annotation at ANLC
- individual researchers using annotation in new
research - pedagogical applications
- new focus on linking archival text and audio
- often viewed as a presentation format
18Shoebox
\ref GW021 \t naatthi nachedahee ey sh
nxedl s_h_i eedah de. \mr naa-tthi
nachedahee ey sh n-xedl s_h_i ee-dah
de \mg INTER-watergtPUNCT game.warden that meat
2SG-sled in M-sit if \f if that meat is sitting
in your sled down there, the game warden, \ref
GW022 \t nachedahee ey shax xals_en ts
nintahtel sixunteh.Ü \mr nachedahee ey
shax xals_en ts ni-n-t-ah-teÂ-e
si-xu-n-teh.Ü \mg game.warden that jail to
TERM-2SG-FUT-M-H-send-NOM THM-AREA-M-be \t that
game warden will surely send you to jail \ref
GW023 \t xeyehnii toteey stsuu du
tsanielteh ts \mr xe-y-eh-nii toteey
stsuu du tsa-niel-teh ts \mg
HUM.PL-4OBJ-H-saygtIMPF even 1SG-grandma thus
THM-THM-L-angry CONJ \t they said that to her and
then grandma became angry \ref GW024 \t
Ûxuteey, \mr xuteey \mg okay \f Ûokay,
19Presentation annotation (1)
- Johnson 2005
- printed interlinear text
- accompanyingaudio CD
20Presentation annotation (2)
21Simple HTML markup
lttr valign"top"gtlttd width"100"gt ltOBJECT
CLASSID"clsid02BF25D5-8C17-4B23-BC80-D3488ABDDC6
B" HEIGHT"15" WIDTH"100" CODEBASE"http//www.
apple.com/qtactivex/qtplugin.cab"gt ltPARAM
name"SRC" VALUE"audio/animal18-01.mp3"gt ltPARAM
name"AUTOPLAY" VALUE"false"gt ltPARAM
name"CONTROLLER" VALUE"true"gt ltEMBED
SRC"audio/animal18-01.mp3" HEIGHT"15"
WIDTH"100" AUTOPLAY"false" CONTROLLER"true"
PLUGINSPAGE"http//www.apple.com/quicktime/down
load/"gt lt/EMBEDgt lt/OBJECTgt lt/tdgt lttdgtltspan
class"tc"gt Jaan ch'e dleg. lt/spangtltbrgt ltspan
class"en"gt This is a squirrel. lt/spangtlt/tdgt lt/trgt
22Funkier HTML markup
ltobject classid"clsid02bf25d5-8c17-4b23-bc80-d34
88abddc6b" width"360" height"256"
codebase"http//www.apple.com/qtactivex/qtplugin.
cab"gt ltparam name"src" value"../s_under_video/xu
ns_uu.mov" /gt ltparam name"autoplay" value"true"
/gt ltparam name"controller" value"true"
/gt ltembed src"../s_under_video/xuns_uu.mov"
width"360" height"256 autoplay"true"
controller"true" pluginspage"http//www.apple/
com/quicktime/download/"gt lt/embedgt lt/objectgt ltbr
/gt ltcentergtltimg src"../s_under_images/xuns_uu.gif
"gtlt/centergt ltcentergtltigtit (area, weather) is
good, nicelt/igtlt/centergt
lttrgtlttd align"center"gt lta href"javascriptpopup
('s_under_text/xuns_uu.html')"gt ltimg
src"s_under_images/xuns_uu.gif"
border"0"gtlt/agt lt/tdgt lttd align"center"
class"gloss"gtit (area, weather) is good,
nicelt/tdgt lt/trgt
23Problems
- entirely ad-hoc process
- each product created independently from scratch
- no archival (aligned) version
- no relations between different texts/languages
24Toward less ephemeral annotation
- Denaina (Athabascan) case study
- Goal link media and annotation in a way which
- facilitates user-friendly access
- follows BP recommendations for archiving
25Existing documentation
- typescript texts (Tenebaum 1976)
- original audio recordings
26Chggagga Sukdua (chickadee story) by
KatherineTrefon
27Chulyin Sukdua (raven story) by Alexie Evan
28Metadata
no relation between original recording and
individual story
29Product Goals
- Audio available in one-line increments and as a
whole - Support for non-Unicode-enabled machines
- Metadata for each story included
- use only free, or very cheap, open-source
software in the production of the CDs - use readily available software for the viewing of
the CDs (web browsers, QuickTime, FlashPlayer)
30Further Goals
- design a workflow that can be easily mimicked by
community members with little training - create archival version of the alignment in
accordance with BP recommendations
31What we did
- assembled transcriptions and audio files
- segmented audio using ELAN
- edited .eaf file to enter text
- built HTML using XSLT to transform .eaf files
- chunked audio and converted to Flash
- developed training procedure (in progress)
32Project workflow
- gather existing media
- previously digitized Denaina recordings, some up
to 40 years old - transcription and translation
- WordPerfect file
- Hand-written transcription
- images from various sources (Park Service, UA
Archive, private)
33Audio/Text alignment
- time segmentations creating in ELAN
- but first learn to use ELAN
- edit .eaf file and paste in existing Denaina and
English text
34ELAN
35Investigate presentation audio format
- SMIL explored but not adopted
- two methods employed
- single Quicktime file with start/stop times
- multiple Flash files, one for each line
- Investigate possible presentation formats in
order to determine the path to get there. - Learn all about SMIL, QuickTime, XSLT, a new
LINGUIST List interface, etc... - Chose HTML built from XSLT, so...
36Transform ELAN file
- First, add more markup
- ltDENTITLEgt , ltENGTITLEgt , ltSPEAKERgt
- Use XSLT to generate basic HTML structure
- originally done awkwardly using ColfFusion server
- edit HTML output to insert
- timecodes (for Quicktime version)
- .swf file names (for Flash version)
37XSLT
38Quicktime time codes
- Advantages
- no audio editing
- Disadvantages
- time codes had to be copied by hand from .eaf
header - many machine must be hand configured to use QT
plugin - pages with many embedded QT files will crash
browser
39Audio chunking
- originally done by hand using time codes in .eaf
header - later automated with a clip extractor (using Tcl)
- batch conversion of clips to .swf files
40The product
41Complications
- no straightforward method to create presentation
format from ELAN XML - chunking audio is cumbersome
- conversion to .swf files is cumbersome
- images have to added later -- not linked to
text/audio - procedure is complex -- results of training have
been mixed (so far) - Still no archival format!
42Future desiderata
- single interface to search across multiple texts
- i.e., corpus browser
- direct linkage to lexical database (cf. Lachlers
Haida) - ways to link non-linguistic data (e.g., photos)
- richer participant metadata
43Thanks to
- National Endowment for the Humanities
- National Science Foundation Office of Polar
Programs - National Science Foundation Documenting
Endangered Languages initiative - University of Alaska Foundation
- Australian Research Council HCSNet
44Qua
gary.holton_at_uaf.edu andrea_at_linguistlist.org