Title: Learning a token classification from a large corpus (A case study in abbreviations)
1Learning a token classification from a large
corpus(A case study in abbreviations)
- Petya Osenova Kiril Simov
- BulTreeBank Project
- (www.BulTreeBank.org)
- Linguistic Modeling Laboratory, Bulgarian Academy
of Sciences - petya_at_bultreebank.org, kivs_at_bultreebank.org
- ESSLLI'2002 Workshop on
- Machine Learning Approaches in Computational
Linguistics - August 5 - 9, 2002
2Plan of the talk
- BulTreeBank Project
- Text Archive
- Token Processing Problem
- Global Token Classification
- Application to Abbreviations
3BulTreeBank project
- It is a joint project between the Linguistic
Modeling Laboratory (LML), Bulgarian Academy of
Sciences and Seminar fuer Sprachwissenschaft
(SfS), Tuebingen. It is funded by Volkswagen
Foundation, Germany. - Its main goal is the creation of a high quality
syntactic treebank of Bulgarian which will be
HPSG oriented. - It also aims at producing a parser and a partial
grammar of Bulgarian. - Within the project an XML-based system for
corpora development is being created.
4BulTreeBank team
- Principle researcher
- Kiril Simov
- Researchers
- Petya Osenova, Milena Slavcheva, Sia Kolkovska
- PhD student
- Elisaveta Balabanova
- Students
- Alexander Simov, Milen Kouylekov,
- Krasimira Ivanova, Dimitar Dojkov
5BulTreeBank text archive
- A collection of linguistically interpreted texts
from different genres (target size 100 million
words) - Linguistically interpreted text is a text in
which all meaningful tokens (including numbers,
special signs and others) are marked-up with
linguistic descriptions
6The current state of the text archive
- Nearly 90 000 000 running words 15 fiction,
- 78 newspapers and 7 legal texts, government
bulletins and other genres - About 70 million running words are converted into
XML format with respect to TEI guidelines - 10 million running words are morphologically
tagged - 500 000 running words are manually disambiguated
7Pre-processing steps (1)
- Morphosyntactic tagger
- Assigning all appropriate morpho-syntactic
features to each potential word - Part-of-speech disambiguator
- Choosing the right morpho-syntactic features for
each potential word in the context - Partial Grammar for non-word tokens
-
8Pre-processing steps (2)
- Partial grammars
- Sentence boundaries grammar
- Named Entity Recognition
- Names of people, places, organizations etc.
- Dates, currencies, numerical expressions
- Abbreviations
- Foreign tokens
- Chunk grammar (Abney 1991, 1996)
- Non-recursive constituents
9Token processing problem
- A token in a text receives its linguistic
interpretation on the basis of two sources of
information (1) the language and (2) the context
of use - Two problems
- For less studied languages there is no enough
language resources (low level of linguistic
interpretation) - Erroneous use in the context (wrong prediction)
10Token classification
- Symbol-based classification
- The tokens are defined by their immanent
graphical characteristics - General token classification
- The tokens fall into several categories common
word, proper name, abbreviation, symbols,
punctuation, error - Grammatical and semantic classification
- The tokens are presented in several lexicons, in
which their grammatical and semantic features are
listed
11General token classification
- Our goal is to learn a corpus-based
classification of the tokens with respect to the
general token classification - We use this classification in two ways
- For an initial classification of the tokens in
the texts before consulting the dictionary, and - For processing linguistically the tokens from the
different classes
12Learning general token categories (1)
- Token classes
- Common words
- typical - lowercased and first capital letter in
sentence-initial position non-typical - all caps - Proper names
- typical - first capital letter non-typical -
all caps wrong - lowercased - Abbreviations
- typical - all caps, mixed, lowercased (with
period, hyphen or a single letter)
13Learning general token categories (2)
- Some problems
- Some tokens can belong to more than one class
according to their graphical properties. - Spelling errors in a large set of texts could
cause misclassification.
14Learning general token categories (3)
- Our classification is not boolean but
gradual-ranking of tokens with respect to each of
the above categories. - Our initial procedure included the following
steps - We used some graphical criteria for assigning
potential categories to the unknown tokens. - We used statistical methods to make a distinction
within each category between the most frequent
tokens of this category and tokens not in the
category or rare tokens.
15Learning general token categories (4)
- Graphical criterion
- It takes into account the graphical specificity
of the tokens. - For each category a list of tokens potentially
belonging to it was constructed - Well known problems such as
- Common words written in capital letters
- Abbreviations written in a wrong way
- The graphical criterion is not sufficient
16Learning general token categories (5)
- Statistical criterion
- For each category, in order to get the maximal
number of right predictions for candidate tokens,
every candidate-token is ranked - In fact we classify normalized tokens
- A normalized token is an abstraction over tokens
that share the same sequence of letters from a
given alphabet
17Learning general token categories (6)
- Ranking with a category (1)
- The ranking formula is
- Rank TokParDocPar
- where the two parameters are
- TokPar True/All
- The number of true appearances of the token
divided by the number of all appearances of the
token - DocPar
- The number of the documents in which the
correctly written token was found if this number
is less that 50, otherwise this value is 50
18Learning general token categories (7)
- Ranking with a category (2)
- The first parameter does not make difference
between one or hundred occurrences. Thus, the
real scope of distribution is lost - The impact that the token has over the text
archive is represented by the second parameter.
The upper bound of 50 is used as a normalization
parameter. - Thus the tokens that are rare or do not belong to
the category receive a very small rank.
19Learning general token categories (8)
- Usefulness
- The method tolerates the tokens with greater
impact over the whole corpus - The tokens appearing in a small number of
documents are processed by local-based approaches
(document-centered)
20General token categories and local-based
approaches
- The local-based approaches can filter general
classification with respect to ambiguous or
unusual usage of tokens - When the local-based approach is unapplicable,
the information is taken from the general token
classification - The result of such a ranking is very useful for
the other task mentioned above - the linguistic
treatment of the unknown tokens
21Abbreviations in the pre-processing
- Abbreviations are special tokens in the text
- They contribute to a robust
- tagging
- disambiguation
- shallow parsing
22Extraction criteria
- Three criteria
- Graphical criterion (as above)
- Statistical criterion (as above)
- Context criterion - we tried to extract some
abbreviations with their extensions written
usually in brackets. Thus the ambiguity is
reduced.
23Dealing with abbreviations
- Our approach includes three steps
- Typological classification - the existing
classifications were refined with respect to the
electronic treatment of abbreviations - Extraction - different criteria were proposed for
the extraction of the most frequent abbreviations
in the archive - Linguistic treatment - the abbreviations were
extended and the relevant grammatical information
was added
24Typological classification
25Linguistic treatment (1)
- Encoding the linguistic information shared by
all abbreviations - head element presents the abbreviation itself
- every abbreviation has a generalized type
acronym or word - every abbreviation has at least one extension
- every extension element consists of a phrase
26Linguistic treatment (2)
- Encoding the linguistic information shared by
some types of abbreviations - the non-lexicalized abbreviations were assigned
grammatical information according to its
syntactic head. Thus the element 'class' was
introduced. - the partly lexicalized abbreviations were
assigned additionally grammatical information
according to their inflection. Thus the element
'flex' was introduced. - the abbreviations of foreign origin usually have
an additional head element, called headforeign
(headf).
27Examples (1)
- type ACRONYM
- ltabbrgtltheadgt???lt/headgtltacronym/gtltexpangtltphrasegt???
???? ?? ???????????? ?????lt/phrasegtltclassgt????lt/cl
assgtlt/expangtltabbrgt - ltabbrgtltheadgt??lt/headgtltacronym/gt
- ltexpangtltphrasegt???????? ???????????lt/phrasegt
ltclassgt?????lt/classgt lt/expangt - ltexpangtltphrasegt????????????ka ??????lt/phrasegt
ltclassgt????lt/classgtlt/expangtlt/abbrgt - ltabbrgtltheadgt????lt/headgtltacronym/gtltexpangtltphrasegt??
??? ?? ???????? ?? ??????????????
???????lt/phrasegtltclassgt????lt/classgt - ltflexgt????-?,????-??,????-???lt/flexgtlt/expangtlt/abb
rgt - ltabbrgtltheadgt???lt/headgtltheadfgtFBIlt/headfgtltacronym/gt
- ltexpangtltphrasegt????????? ???? ??
???????????lt/phrasegt - ltclassgt?????lt/classgtlt/expangtltabbrgt
28Examples (2)
- type WORD
- ltabbrgtltheadgt?-??lt/headgt ltword/gt
- ltexpangtltphrasegt?????????lt/phrasegtlt/expangtlt/abbrgt
- ltabbrgtltheadgt??.ltheadgtltword/gt
- ltexpangtltphrasegt????lt/phrasegtlt/expangtlt/abbrgt
- ltabbrgtltheadgt?.lt/headgtltheadgt?-?lt/headgtltword/gt
- ltexpangtltphrasegt???????lt/phrasegtlt/expangtlt/abbrgt
- ltabbrgtltheadgt??.lt/headgtltword/gt
- ltexpangt?????lt/expangt
- ltexpangt????lt/expangtlt/abbrgt
29Evaluation
- The method is hard for absolute evaluation with
respect to only one class of tokens - We apply only relative evaluation with respect to
a given rank - Only precision measure is really applicable
- The recall is practically equal to 100
- Precision 98.7 for the first 557 candidates
(Rank gt 25)
30Other applications
- Classification and linguistic treatment of other
classes of tokens names, sentence boundary
markers - (similar to abbreviation)
- Determination of the vocabulary of dictionary for
human use - The lexeme with great impact over the nowadays
texts will be chosen - Similar treatment of the new words
31Future work
- Dealing with different ambiguities
- Combination with other methods as
document-centered, morphological guessers - Using other stochastic methods