Developing Asian Language Corpora: standards and practice - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

Developing Asian Language Corpora: standards and practice

Description:

Audio: MP3 or WAV. Video: MPEG or Quicktime. Image files: PNG or JPG. Ancillary data ... Part of the Hindi corpus is annotated for anaphora ... – PowerPoint PPT presentation

Number of Views:297
Avg rating:3.0/5.0
Slides: 25
Provided by: zhongh
Category:

less

Transcript and Presenter's Notes

Title: Developing Asian Language Corpora: standards and practice


1
Developing Asian Language Corpora standards and
practice
  • Richard Xiao
  • Tony McEnery
  • Paul Baker
  • Andrew Hardie
  • Lancaster University

2
An overview of the talk
  • Corpus development standards
  • The EMILLE (Enabling Minority Language
    Engineering) Corpus
  • The Lancaster Corpus of Mandarin Chinese (LCMC)
  • XML-aware, Unicode-compliant corpus exploration
    tools
  • Software demonstration

3
Corpus development standards (1)
  • Why is standardization important?
  • To be compliant with major international
    standards
  • To facilitate electronic data exchange
  • To foster cooperation and coordination between
    different centres and projects
  • To meet the requirements of corpus validation
  • The ALR Committee is working in the right
    direction

4
Corpus development standards (2)
  • Corpus constituents
  • Corpus manifest
  • Type (paper document, computer file, audio/video
    recording, etc.)
  • Carrier (computer file name and location,
    document title etc.)
  • Status (integral part of corpus, descriptive
    metadata, associated annotation, documentation,
    etc.)
  • Digital components and the storage format
    (character encoding, binary format, record
    structure, etc.)
  • Primary data corpus files
  • Ancillary data corpus documentation

5
Corpus development standards (3)
  • Data formats
  • Primary data
  • Text files XML/SGML conforming to a standard or
    supplied DTD or schema
  • Audio MP3 or WAV
  • Video MPEG or Quicktime
  • Image files PNG or JPG
  • Ancillary data
  • Documentation PDF, HTML, or XML

6
Corpus development standards (4)
  • File structure, markup and annotation
  • Corpus header
  • providing metadata about the corpus file
  • TEI/CES-compliance
  • Corpus body
  • Containing the corpus data
  • TEI/CES-compliance
  • Markup for paragraphs and sentences
  • Preferably annotated with various levels of
    linguistic analysis (POS tagging)
  • Character encoding
  • Unicode-compliance (UTF-8/16)

7
The EMILLE project
  • The EMILLE project
  • Funded by the UK EPSRC (Grant references
    GR/N19106, GR/M70735, GR/N28542 and GR/R42429/01)
  • Research partners Lancaster University,
    Sheffield University, and the Central Institute
    of Languages (CIIL) in Mysore, India
  • Three main goals
  • To build corpora of South Asian languages
  • To extend the GATE (General Architecture for Text
    Engineering) LE architecture
  • To develop basic LE tools
  • Project site http//www.emille.lancs.ac.uk/
  • GATE http//gate.ac.uk/sale/tao/index.htmlx1-550
    002.26

8
The EMILLE Corpus An overview
  • Three components
  • Monolingual, Annotated, and Parallel
  • 14 South Asian languages
  • Spoken data for five language
  • Monolingual corpora contain more than 96 million
    words
  • Spoken data over 2.6 million words
  • The Urdu corpus is POS tagged
  • Part of the Hindi corpus is annotated for
    anaphora
  • Parallel corpus covers English and five South
    Asian languages
  • Corpus building tools Uni-codify, Uni-viewer,
    Uni-editor

9
The EMILLE Monolingual Corpora
10
The EMILLE Annotated Corpora
  • POS tagging
  • The whole monolingual Urdu corpus
  • The Urdu component of the EMILLE Parallel Corpora
  • Anaphoric annotation
  • Around 100,000 words of news material (20
    excerpts from the Ranchi Express data) from the
    Hindi Monolingual Corpus

11
The EMILLE Parallel Corpus
  • 75 advice leaflets published by the UK government
  • Approximately 200,000 words of English originals
    with accompanying translations in five South
    Asian languages
  • Hindi, Bengali, Punjabi, Gujarati, and Urdu
  • Covering a range of term-rich domains

12
The EMILLE corpus building tools
  • Uni-codify
  • Allows users to convert 30 (or so) different
    8-bit encodings of South Asian scripts into
    16-bit little-endian Unicode format
  • Compiled program accompanied by documentation
  • Uni-Viewer
  • Allows users to view Unicode texts
  • Uni-Editor
  • Allows users to edit Unicode texts
  • Urdu POS tagger
  • POS tagging Unicode-encoded Urdu texts
  • Accompanied by the tagset and the user manual

13
The EMILLE Corpus Availability
  • The full release of the EMILLE Corpus and tools
    are distributed free of charge for use in
    non-profit-making research
  • Digital sound files will also be released soon
  • Indexed version for use with Xara will be
    available soon
  • Corpus download site
  • http//www.ling.lancs.ac.uk/corplang/emille

14
The LCMC Corpus Aims
  • Built for the ESRC project Contrasting tense and
    aspect in English and Chinese (Grant Ref.
    RES-000-220135)
  • A Chinese match for FLOB/Frown for BrE/AmE
  • A publicly available balanced corpus of Mandarin
    Chinese
  • Distributed free of charge for use in
    non-profit-making research

15
LCMC Profile
  • One million words
  • 1990-1993
  • 15 text categories
  • 500 text samples
  • Major text provider SSReader Digital Library in
    China
  • Unicode (UTF-8)
  • XML-conformant mark-up
  • Marked for paragraphs and sentences
  • POS-tagged (precision rate 98)
  • Standard character and Romanized Pinyin versions

16
Major Chinese corpus resources
17
LCMC Sampling frame
18
LCMC Markup
19
LCMC Annotation
  • Segmentation
  • POS tagging
  • Applying the Peking University tagset
  • 26 Level 1 POS tags
  • 50 Level 2 POS tags
  • ICTCLAS (Chinese Lexical Analysis System)
  • Developed by the Institute of Computing
    Technology, Chinese Academy of Sciences (Zhang
    Liu 2002)
  • A frequency dictionary of 80,000 words
  • Based on a multi-layer hidden Markov model
  • Applying the n-shortest paths method
  • Automatic tagging with a precision rate of 97.16
  • Post-editing improved the precision to over 98

20
LCMC Potential use
  • Monolingual study
  • Studying Mandarin Chinese as a whole
  • Exploring variation across text categories
  • Contrastive study (in conjunction with
    FLOB/Frown)
  • Contrasting Chinese and BrE/AmE
  • Contrasting text categories in Chinese and English

21
LCMC Availability
  • Distributed free of charge for use in
    non-profit-making research
  • Accompanied by the user manual
  • Online search available via WebConc
  • The LCMC website
  • http//www.ling.lancs.ac.uk/corplang/lcmc
  • The Chinese mirror site (Chinese Academy of
    Social Science)
  • http//www.cass.net.cn/chinese/s18_yys/dangdai/LCM
    C/LCMC.htm

22
Corpus exploration tools
  • XML-aware, Unicode-compliant corpus exploration
    tools
  • The WordSmith Tools version 4
  • Presently under beta test
  • Beta version available
  • http//www.lexically.net/wordsmith/version4/index.
    htm
  • Xara (XML-aware Sara)
  • Sara SGML-aware Retrieval Application
  • For use with the British National Corpus (BNC)
  • For either local or remote access
  • Presently under beta test
  • Documentation available at http//www.oucs.ox.ac.u
    k/rts/xara/
  • A tutorial available at the LCMC website

23
Software demonstration
  • Using Xara for local access to LCMC
  • Query types Quick query, word query (pattern),
    POS query, pattern query (regex), Query builder
    (e.g. a-n vs. a-de-n), etc
  • Display mode KWIC mode vs. sentence mode
  • Display format Plain vs. XML
  • Status bar Reference
  • Other useful features distribution, sort,
    collocation, partition, user-defined stylesheets,
    etc.
  • Using Xara to for local access to EMILLE
  • Using WebConC to access LCMC
  • http//www.ling.lancs.ac.uk/corplang/lcmc

24
And
  • Thank you!
  • Richard Xiao
  • z.xiao_at_lancaster.ac.uk
Write a Comment
User Comments (0)
About PowerShow.com