Learning a token classification from a large corpus (A case study in abbreviations) - PowerPoint PPT Presentation

About This Presentation

Title:

Learning a token classification from a large corpus (A case study in abbreviations)

Description:

Linguistic Modeling Laboratory, Bulgarian Academy of Sciences ... It is funded by Volkswagen Foundation, Germany. ... The recall is practically equal to 100% ... – PowerPoint PPT presentation

Number of Views:34

Avg rating:3.0/5.0

Slides: 32

Provided by: Naso

Category:

more less

Transcript and Presenter's Notes

Title: Learning a token classification from a large corpus (A case study in abbreviations)

1
Learning a token classification from a large
corpus(A case study in abbreviations)

Petya Osenova Kiril Simov
BulTreeBank Project
(www.BulTreeBank.org)
Linguistic Modeling Laboratory, Bulgarian Academy
of Sciences
petya_at_bultreebank.org, kivs_at_bultreebank.org
ESSLLI'2002 Workshop on
Machine Learning Approaches in Computational
Linguistics
August 5 - 9, 2002

2
Plan of the talk

BulTreeBank Project
Text Archive
Token Processing Problem
Global Token Classification
Application to Abbreviations

3
BulTreeBank project

It is a joint project between the Linguistic
Modeling Laboratory (LML), Bulgarian Academy of
Sciences and Seminar fuer Sprachwissenschaft
(SfS), Tuebingen. It is funded by Volkswagen
Foundation, Germany.
Its main goal is the creation of a high quality
syntactic treebank of Bulgarian which will be
HPSG oriented.
It also aims at producing a parser and a partial
grammar of Bulgarian.
Within the project an XML-based system for
corpora development is being created.

4
BulTreeBank team

Principle researcher
Kiril Simov
Researchers
Petya Osenova, Milena Slavcheva, Sia Kolkovska
PhD student
Elisaveta Balabanova
Students
Alexander Simov, Milen Kouylekov,
Krasimira Ivanova, Dimitar Dojkov

5
BulTreeBank text archive

A collection of linguistically interpreted texts
from different genres (target size 100 million
words)
Linguistically interpreted text is a text in
which all meaningful tokens (including numbers,
special signs and others) are marked-up with
linguistic descriptions

6
The current state of the text archive

Nearly 90 000 000 running words 15 fiction,
78 newspapers and 7 legal texts, government
bulletins and other genres
About 70 million running words are converted into
XML format with respect to TEI guidelines
10 million running words are morphologically
tagged
500 000 running words are manually disambiguated

7
Pre-processing steps (1)

Morphosyntactic tagger
Assigning all appropriate morpho-syntactic
features to each potential word
Part-of-speech disambiguator
Choosing the right morpho-syntactic features for
each potential word in the context
Partial Grammar for non-word tokens

8
Pre-processing steps (2)

Partial grammars
Sentence boundaries grammar
Named Entity Recognition
Names of people, places, organizations etc.
Dates, currencies, numerical expressions
Abbreviations
Foreign tokens
Chunk grammar (Abney 1991, 1996)
Non-recursive constituents

9
Token processing problem

A token in a text receives its linguistic
interpretation on the basis of two sources of
information (1) the language and (2) the context
of use
Two problems
For less studied languages there is no enough
language resources (low level of linguistic
interpretation)
Erroneous use in the context (wrong prediction)

10
Token classification

Symbol-based classification
The tokens are defined by their immanent
graphical characteristics
General token classification
The tokens fall into several categories common
word, proper name, abbreviation, symbols,
punctuation, error
Grammatical and semantic classification
The tokens are presented in several lexicons, in
which their grammatical and semantic features are
listed

11
General token classification

Our goal is to learn a corpus-based
classification of the tokens with respect to the
general token classification
We use this classification in two ways
For an initial classification of the tokens in
the texts before consulting the dictionary, and
For processing linguistically the tokens from the
different classes

12
Learning general token categories (1)

Token classes
Common words
typical - lowercased and first capital letter in
sentence-initial position non-typical - all caps
Proper names
typical - first capital letter non-typical -
all caps wrong - lowercased
Abbreviations
typical - all caps, mixed, lowercased (with
period, hyphen or a single letter)

13
Learning general token categories (2)

Some problems
Some tokens can belong to more than one class
according to their graphical properties.
Spelling errors in a large set of texts could
cause misclassification.

14
Learning general token categories (3)

Our classification is not boolean but
gradual-ranking of tokens with respect to each of
the above categories.
Our initial procedure included the following
steps
We used some graphical criteria for assigning
potential categories to the unknown tokens.
We used statistical methods to make a distinction
within each category between the most frequent
tokens of this category and tokens not in the
category or rare tokens.

15
Learning general token categories (4)

Graphical criterion
It takes into account the graphical specificity
of the tokens.
For each category a list of tokens potentially
belonging to it was constructed
Well known problems such as
Common words written in capital letters
Abbreviations written in a wrong way
The graphical criterion is not sufficient

16
Learning general token categories (5)

Statistical criterion
For each category, in order to get the maximal
number of right predictions for candidate tokens,
every candidate-token is ranked
In fact we classify normalized tokens
A normalized token is an abstraction over tokens
that share the same sequence of letters from a
given alphabet

17
Learning general token categories (6)

Ranking with a category (1)
The ranking formula is
Rank TokParDocPar
where the two parameters are
TokPar True/All
The number of true appearances of the token
divided by the number of all appearances of the
token
DocPar
The number of the documents in which the
correctly written token was found if this number
is less that 50, otherwise this value is 50

18
Learning general token categories (7)

Ranking with a category (2)
The first parameter does not make difference
between one or hundred occurrences. Thus, the
real scope of distribution is lost
The impact that the token has over the text
archive is represented by the second parameter.
The upper bound of 50 is used as a normalization
parameter.
Thus the tokens that are rare or do not belong to
the category receive a very small rank.

19
Learning general token categories (8)

Usefulness
The method tolerates the tokens with greater
impact over the whole corpus
The tokens appearing in a small number of
documents are processed by local-based approaches
(document-centered)

20
General token categories and local-based
approaches

The local-based approaches can filter general
classification with respect to ambiguous or
unusual usage of tokens
When the local-based approach is unapplicable,
the information is taken from the general token
classification
The result of such a ranking is very useful for
the other task mentioned above - the linguistic
treatment of the unknown tokens

21
Abbreviations in the pre-processing

Abbreviations are special tokens in the text
They contribute to a robust
tagging
disambiguation
shallow parsing

22
Extraction criteria

Three criteria
Graphical criterion (as above)
Statistical criterion (as above)
Context criterion - we tried to extract some
abbreviations with their extensions written
usually in brackets. Thus the ambiguity is
reduced.

23
Dealing with abbreviations

Our approach includes three steps
Typological classification - the existing
classifications were refined with respect to the
electronic treatment of abbreviations
Extraction - different criteria were proposed for
the extraction of the most frequent abbreviations
in the archive
Linguistic treatment - the abbreviations were
extended and the relevant grammatical information
was added

24
Typological classification
25
Linguistic treatment (1)

Encoding the linguistic information shared by
all abbreviations
head element presents the abbreviation itself
every abbreviation has a generalized type
acronym or word
every abbreviation has at least one extension
every extension element consists of a phrase

26
Linguistic treatment (2)

Encoding the linguistic information shared by
some types of abbreviations
the non-lexicalized abbreviations were assigned
grammatical information according to its
syntactic head. Thus the element 'class' was
introduced.
the partly lexicalized abbreviations were
assigned additionally grammatical information
according to their inflection. Thus the element
'flex' was introduced.
the abbreviations of foreign origin usually have
an additional head element, called headforeign
(headf).

27
Examples (1)

type ACRONYM
ltabbrgtltheadgt???lt/headgtltacronym/gtltexpangtltphrasegt???
???? ?? ???????????? ?????lt/phrasegtltclassgt????lt/cl
assgtlt/expangtltabbrgt
ltabbrgtltheadgt??lt/headgtltacronym/gt
ltexpangtltphrasegt???????? ???????????lt/phrasegt
ltclassgt?????lt/classgt lt/expangt
ltexpangtltphrasegt????????????ka ??????lt/phrasegt
ltclassgt????lt/classgtlt/expangtlt/abbrgt
ltabbrgtltheadgt????lt/headgtltacronym/gtltexpangtltphrasegt??
??? ?? ???????? ?? ??????????????
???????lt/phrasegtltclassgt????lt/classgt
ltflexgt????-?,????-??,????-???lt/flexgtlt/expangtlt/abb
rgt
ltabbrgtltheadgt???lt/headgtltheadfgtFBIlt/headfgtltacronym/gt
ltexpangtltphrasegt????????? ???? ??
???????????lt/phrasegt
ltclassgt?????lt/classgtlt/expangtltabbrgt

28
Examples (2)

type WORD
ltabbrgtltheadgt?-??lt/headgt ltword/gt
ltexpangtltphrasegt?????????lt/phrasegtlt/expangtlt/abbrgt
ltabbrgtltheadgt??.ltheadgtltword/gt
ltexpangtltphrasegt????lt/phrasegtlt/expangtlt/abbrgt
ltabbrgtltheadgt?.lt/headgtltheadgt?-?lt/headgtltword/gt
ltexpangtltphrasegt???????lt/phrasegtlt/expangtlt/abbrgt
ltabbrgtltheadgt??.lt/headgtltword/gt
ltexpangt?????lt/expangt
ltexpangt????lt/expangtlt/abbrgt

29
Evaluation

The method is hard for absolute evaluation with
respect to only one class of tokens
We apply only relative evaluation with respect to
a given rank
Only precision measure is really applicable
The recall is practically equal to 100
Precision 98.7 for the first 557 candidates
(Rank gt 25)

30
Other applications

Classification and linguistic treatment of other
classes of tokens names, sentence boundary
markers
(similar to abbreviation)
Determination of the vocabulary of dictionary for
human use
The lexeme with great impact over the nowadays
texts will be chosen
Similar treatment of the new words

31
Future work