Learning a token classification from a large corpus (A case study in abbreviations) - PowerPoint PPT Presentation

About This Presentation
Title:

Learning a token classification from a large corpus (A case study in abbreviations)

Description:

Linguistic Modeling Laboratory, Bulgarian Academy of Sciences ... It is funded by Volkswagen Foundation, Germany. ... The recall is practically equal to 100% ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 32
Provided by: Naso
Category:

less

Transcript and Presenter's Notes

Title: Learning a token classification from a large corpus (A case study in abbreviations)


1
Learning a token classification from a large
corpus(A case study in abbreviations)
  • Petya Osenova Kiril Simov
  • BulTreeBank Project
  • (www.BulTreeBank.org)
  • Linguistic Modeling Laboratory, Bulgarian Academy
    of Sciences
  • petya_at_bultreebank.org, kivs_at_bultreebank.org
  • ESSLLI'2002 Workshop on
  • Machine Learning Approaches in Computational
    Linguistics
  • August 5 - 9, 2002

2
Plan of the talk
  • BulTreeBank Project
  • Text Archive
  • Token Processing Problem
  • Global Token Classification
  • Application to Abbreviations

3
BulTreeBank project
  • It is a joint project between the Linguistic
    Modeling Laboratory (LML), Bulgarian Academy of
    Sciences and Seminar fuer Sprachwissenschaft
    (SfS), Tuebingen. It is funded by Volkswagen
    Foundation, Germany.
  • Its main goal is the creation of a high quality
    syntactic treebank of Bulgarian which will be
    HPSG oriented.
  • It also aims at producing a parser and a partial
    grammar of Bulgarian.
  • Within the project an XML-based system for
    corpora development is being created.

4
BulTreeBank team
  • Principle researcher
  • Kiril Simov
  • Researchers
  • Petya Osenova, Milena Slavcheva, Sia Kolkovska
  • PhD student
  • Elisaveta Balabanova
  • Students
  • Alexander Simov, Milen Kouylekov,
  • Krasimira Ivanova, Dimitar Dojkov

5
BulTreeBank text archive
  • A collection of linguistically interpreted texts
    from different genres (target size 100 million
    words)
  • Linguistically interpreted text is a text in
    which all meaningful tokens (including numbers,
    special signs and others) are marked-up with
    linguistic descriptions

6
The current state of the text archive
  • Nearly 90 000 000 running words 15 fiction,
  • 78 newspapers and 7 legal texts, government
    bulletins and other genres
  • About 70 million running words are converted into
    XML format with respect to TEI guidelines
  • 10 million running words are morphologically
    tagged
  • 500 000 running words are manually disambiguated

7
Pre-processing steps (1)
  • Morphosyntactic tagger
  • Assigning all appropriate morpho-syntactic
    features to each potential word
  • Part-of-speech disambiguator
  • Choosing the right morpho-syntactic features for
    each potential word in the context
  • Partial Grammar for non-word tokens

8
Pre-processing steps (2)
  • Partial grammars
  • Sentence boundaries grammar
  • Named Entity Recognition
  • Names of people, places, organizations etc.
  • Dates, currencies, numerical expressions
  • Abbreviations
  • Foreign tokens
  • Chunk grammar (Abney 1991, 1996)
  • Non-recursive constituents

9
Token processing problem
  • A token in a text receives its linguistic
    interpretation on the basis of two sources of
    information (1) the language and (2) the context
    of use
  • Two problems
  • For less studied languages there is no enough
    language resources (low level of linguistic
    interpretation)
  • Erroneous use in the context (wrong prediction)

10
Token classification
  • Symbol-based classification
  • The tokens are defined by their immanent
    graphical characteristics
  • General token classification
  • The tokens fall into several categories common
    word, proper name, abbreviation, symbols,
    punctuation, error
  • Grammatical and semantic classification
  • The tokens are presented in several lexicons, in
    which their grammatical and semantic features are
    listed

11
General token classification
  • Our goal is to learn a corpus-based
    classification of the tokens with respect to the
    general token classification
  • We use this classification in two ways
  • For an initial classification of the tokens in
    the texts before consulting the dictionary, and
  • For processing linguistically the tokens from the
    different classes

12
Learning general token categories (1)
  • Token classes
  • Common words
  • typical - lowercased and first capital letter in
    sentence-initial position non-typical - all caps
  • Proper names
  • typical - first capital letter non-typical -
    all caps wrong - lowercased
  • Abbreviations
  • typical - all caps, mixed, lowercased (with
    period, hyphen or a single letter)

13
Learning general token categories (2)
  • Some problems
  • Some tokens can belong to more than one class
    according to their graphical properties.
  • Spelling errors in a large set of texts could
    cause misclassification.

14
Learning general token categories (3)
  • Our classification is not boolean but
    gradual-ranking of tokens with respect to each of
    the above categories.
  • Our initial procedure included the following
    steps
  • We used some graphical criteria for assigning
    potential categories to the unknown tokens.
  • We used statistical methods to make a distinction
    within each category between the most frequent
    tokens of this category and tokens not in the
    category or rare tokens.

15
Learning general token categories (4)
  • Graphical criterion
  • It takes into account the graphical specificity
    of the tokens.
  • For each category a list of tokens potentially
    belonging to it was constructed
  • Well known problems such as
  • Common words written in capital letters
  • Abbreviations written in a wrong way
  • The graphical criterion is not sufficient

16
Learning general token categories (5)
  • Statistical criterion
  • For each category, in order to get the maximal
    number of right predictions for candidate tokens,
    every candidate-token is ranked
  • In fact we classify normalized tokens
  • A normalized token is an abstraction over tokens
    that share the same sequence of letters from a
    given alphabet

17
Learning general token categories (6)
  • Ranking with a category (1)
  • The ranking formula is
  • Rank TokParDocPar
  • where the two parameters are
  • TokPar True/All
  • The number of true appearances of the token
    divided by the number of all appearances of the
    token
  • DocPar
  • The number of the documents in which the
    correctly written token was found if this number
    is less that 50, otherwise this value is 50

18
Learning general token categories (7)
  • Ranking with a category (2)
  • The first parameter does not make difference
    between one or hundred occurrences. Thus, the
    real scope of distribution is lost
  • The impact that the token has over the text
    archive is represented by the second parameter.
    The upper bound of 50 is used as a normalization
    parameter.
  • Thus the tokens that are rare or do not belong to
    the category receive a very small rank.

19
Learning general token categories (8)
  • Usefulness
  • The method tolerates the tokens with greater
    impact over the whole corpus
  • The tokens appearing in a small number of
    documents are processed by local-based approaches
    (document-centered)

20
General token categories and local-based
approaches
  • The local-based approaches can filter general
    classification with respect to ambiguous or
    unusual usage of tokens
  • When the local-based approach is unapplicable,
    the information is taken from the general token
    classification
  • The result of such a ranking is very useful for
    the other task mentioned above - the linguistic
    treatment of the unknown tokens

21
Abbreviations in the pre-processing
  • Abbreviations are special tokens in the text
  • They contribute to a robust
  • tagging
  • disambiguation
  • shallow parsing

22
Extraction criteria
  • Three criteria
  • Graphical criterion (as above)
  • Statistical criterion (as above)
  • Context criterion - we tried to extract some
    abbreviations with their extensions written
    usually in brackets. Thus the ambiguity is
    reduced.

23
Dealing with abbreviations
  • Our approach includes three steps
  • Typological classification - the existing
    classifications were refined with respect to the
    electronic treatment of abbreviations
  • Extraction - different criteria were proposed for
    the extraction of the most frequent abbreviations
    in the archive
  • Linguistic treatment - the abbreviations were
    extended and the relevant grammatical information
    was added

24
Typological classification
25
Linguistic treatment (1)
  • Encoding the linguistic information shared by
    all abbreviations
  • head element presents the abbreviation itself
  • every abbreviation has a generalized type
    acronym or word
  • every abbreviation has at least one extension
  • every extension element consists of a phrase

26
Linguistic treatment (2)
  • Encoding the linguistic information shared by
    some types of abbreviations
  • the non-lexicalized abbreviations were assigned
    grammatical information according to its
    syntactic head. Thus the element 'class' was
    introduced.
  • the partly lexicalized abbreviations were
    assigned additionally grammatical information
    according to their inflection. Thus the element
    'flex' was introduced.
  • the abbreviations of foreign origin usually have
    an additional head element, called headforeign
    (headf).

27
Examples (1)
  • type ACRONYM
  • ltabbrgtltheadgt???lt/headgtltacronym/gtltexpangtltphrasegt???
    ???? ?? ???????????? ?????lt/phrasegtltclassgt????lt/cl
    assgtlt/expangtltabbrgt
  • ltabbrgtltheadgt??lt/headgtltacronym/gt
  • ltexpangtltphrasegt???????? ???????????lt/phrasegt
    ltclassgt?????lt/classgt lt/expangt
  • ltexpangtltphrasegt????????????ka ??????lt/phrasegt
    ltclassgt????lt/classgtlt/expangtlt/abbrgt
  • ltabbrgtltheadgt????lt/headgtltacronym/gtltexpangtltphrasegt??
    ??? ?? ???????? ?? ??????????????
    ???????lt/phrasegtltclassgt????lt/classgt
  • ltflexgt????-?,????-??,????-???lt/flexgtlt/expangtlt/abb
    rgt
  • ltabbrgtltheadgt???lt/headgtltheadfgtFBIlt/headfgtltacronym/gt
  • ltexpangtltphrasegt????????? ???? ??
    ???????????lt/phrasegt
  • ltclassgt?????lt/classgtlt/expangtltabbrgt

28
Examples (2)
  • type WORD
  • ltabbrgtltheadgt?-??lt/headgt ltword/gt
  • ltexpangtltphrasegt?????????lt/phrasegtlt/expangtlt/abbrgt
  • ltabbrgtltheadgt??.ltheadgtltword/gt
  • ltexpangtltphrasegt????lt/phrasegtlt/expangtlt/abbrgt
  • ltabbrgtltheadgt?.lt/headgtltheadgt?-?lt/headgtltword/gt
  • ltexpangtltphrasegt???????lt/phrasegtlt/expangtlt/abbrgt
  • ltabbrgtltheadgt??.lt/headgtltword/gt
  • ltexpangt?????lt/expangt
  • ltexpangt????lt/expangtlt/abbrgt

29
Evaluation
  • The method is hard for absolute evaluation with
    respect to only one class of tokens
  • We apply only relative evaluation with respect to
    a given rank
  • Only precision measure is really applicable
  • The recall is practically equal to 100
  • Precision 98.7 for the first 557 candidates
    (Rank gt 25)

30
Other applications
  • Classification and linguistic treatment of other
    classes of tokens names, sentence boundary
    markers
  • (similar to abbreviation)
  • Determination of the vocabulary of dictionary for
    human use
  • The lexeme with great impact over the nowadays
    texts will be chosen
  • Similar treatment of the new words

31
Future work
  • Dealing with different ambiguities
  • Combination with other methods as
    document-centered, morphological guessers
  • Using other stochastic methods
Write a Comment
User Comments (0)
About PowerShow.com