Proper Nouns in Czech Corpora - PowerPoint PPT Presentation

About This Presentation
Title:

Proper Nouns in Czech Corpora

Description:

Proposal of a complex proper noun annotation within the Prague ... (instr for instrumental case, nom for nominative case) (a) person name Kl ra Nov kov Mal ... – PowerPoint PPT presentation

Number of Views:49
Avg rating:3.0/5.0
Slides: 19
Provided by: ufalMf
Category:

less

Transcript and Presenter's Notes

Title: Proper Nouns in Czech Corpora


1
Proper Nouns in Czech Corpora
  • Magda evcíková
  • sevcikova_at_ufal.mff.cuni.cz
  • Institute of Formal and Applied Linguistics
  • Faculty of Mathematics and Physics
  • Charles University in Prague
  • Czech Republic

2
Outline
  • Introduction
  • Proper nouns in corpora of Czech current state
  • Corpus SYN2000
  • Prague Dependency Treebank 2.0
  • Proposal of a complex proper noun annotation
    within the Prague Dependency Treebank 2.0
  • Final remarks

3
Introduction
  • proper nouns
  • lacking a generic meaning
  • denoting individuals, institutions etc.
  • identifying them as unique items
  • proper nouns in NLP
  • question answering
  • information extraction
  • machine translation
  • pan Zelený should not be translated into Mr Green
  • Frankfurt am Main or Frankfurt nad Mohanem, but
    not a combination of both (e.g., Frankfurt nad
    Main)
  • explicit annotation of proper nouns needed

4
Proper nouns in corpora of Czech current state
  • two large corpora of Czech as sources of proper
    nouns
  • SYN2000
  • 100 million tokens
  • morphological annotation
  • morphological lemmas and positional tags
  • no explicit annotation of proper nouns
  • Prague Dependency Treebank 2.0 (PDT 2.0)
  • morphologically and syntactically annotated
  • very basic annotation of proper nouns
  • at the morphological layer
  • at the deep-syntactic (tectogrammatical) layer

5
Proper nouns in SYN2000http//ucnk.ff.cuni.cz
  • proper nouns were not marked
  • other characteristics used for searching for
    proper nouns
  • capitalization
  • only proper nouns capitalized in Czech (in
    comparison, e.g., to German)
  • however, it is not a sufficiently distinctive
    feature (sentence beginnings)
  • context patterns
  • for instance, Mr Xxx / President Xxx

6
(No Transcript)
7
Proper nouns in PDT 2.0http//ufal.mff.cuni.cz/pd
t2.0/
  • basic annotation of proper nouns
  • at the morphological layer
  • each token was assigned a morphological lemma and
    a positional tag
  • lemma flag for marking of proper nouns
  • at the tectogrammatical layer
  • each sentence represented by a labeled dependency
    tree structure (consisting of nodes and edges)
  • special means for annotation of selected
    phenomena concerning proper nouns

8
PDT 2.0 Morphological layer
  • proper noun type indicated by a value of a
    special flag which was attached to lemmas of
    proper nouns by a separator
  • Jan_Y, Zelený_S
  • seven flag values
  • first names, surnames, inhabitant names,
    geographical names, institution names, product
    names, other names
  • convenient for annotation of one-word proper
    nouns
  • insufficient for more complex proper nouns
  • misinterpretations
  • Frankfurt_G nad Mohanem_G
  • Vysoký_K kola ekonomická (University of
    Economics)

9
PDT 2.0 Tectogrammatical layer
  • no complex annotation of proper nouns
  • annotation means for selected phenomena only
  • person names
  • node attribute is_name_of_person
  • non-inflected street names, book titles etc.
    accompanied by a generic noun
  • functor ID
  • book titles etc. which have a form of a
    prepositional group and are not accompanied by a
    generic noun
  • an artificial node with lemma Idph
  • besides these individual cases, proper nouns were
    treated as common parts of a sentence

10
(a)
(b)
(a) person name Klára Nováková Malá (b) V sobotu
v poledne je hezký film (lit. On
Saturday at Noon is a nice film)
(c) li jsme ulicí Spálená (We walked through the
street.instr Spálená.nom) (d) li jsme ulicí
Spálenou (We walked through the street.instr
Spálená.instr) (e) li jsme Spálenou (We walked
through Spálená.instr) (instr for instrumental
case, nom for nominative case)
(c)
(e)
(d)
11
Proposal of a complex proper noun annotation
within PDT 2.0
  • proper noun type defined at each proper noun
  • proper noun classification
  • annotation of one-word proper nouns as well as
    more complicated proper noun structures
  • four structure types to be annotated
  • the inner structure of more complex proper nouns
    described as a non-dependency relation
  • tectogrammatical layer

12
Proper noun classification for Czech
  • two-level classification
  • 1st level five super-types of proper nouns
  • personal names, geographical names, institution
    names, artefact names, media names
  • ( two more types temporal expressions,
    numerical expression occurring in postal
    addresses)
  • 2nd level proper noun types
  • e.g., types of geographical names street/square
    names, city/town names, state names etc.
  • underspecification allowed
  • each type encoded by a unique two-character tag
  • gs for street/square names, gu for city/town
    names
  • g_ for a geographical name of an unknown type

13
Structure types to be annotated
  • (i) one-word proper nouns
  • John
  • (ii) multi-word proper noun expressions
  • Vysoká kola ekonomická (University of Economics)
  • (iii) complex proper noun expressions
  • Frankfurt nad Mohanem
  • (iv) containers
  • Jan Zelený

14
(i) Annotation of one-word proper nouns
  • proper noun type indicated at each proper noun
  • new node attribute NE_roles
  • value set corresponds to all proper noun type
    tags (and container tags)
  • substitutes the current is_name_of_person
    attribute

15
(ii) Annotation of multi-word proper noun
expressions
  • every constituent of a multi-word proper noun
    expression has a node of its own
  • at all nodes, the same value of the NE_roles
    attribute occurs
  • edges in the sub-tree labeled with a new functor
    NEPART
  • syntactic function of the whole expression
    indicated by the functor of the governing node

Vyucuje na Vysoké kole ekonomické (He teaches
at University of Economics)
16
(iii) Annotation of complex proper noun
expressions
  • every constituent has a node of its own
  • a main part (Frankfurt) and an embedded part
    (Mohan)
  • type of the embedded part indicated by the value
    of the NE_roles attribute at the embedded part,
    type of the whole expression at the main part
  • relation between the main and the embedded part
    labeled with the NEPART functor

Navtívil Frankfurt nad Mohanem (He visited
Frankfurt am Main)
17
(iv) Annotation of containers
  • the Idph node as the governing node of the whole
    container
  • container type indicated by the value of the
    NE_roles attribute at the Idph node
  • proper noun types of the constituents defined by
    the values of their belonging NE_roles attributes
  • relations between the Idph node and constituents
    labeled with the NEPART functor

Novým reditelem je Jan Zelený (Jan Zelený is the
new director)
18
Final remarks
  • annotation of proper nouns in corpora
  • linguistic research
  • NLP subtasks
  • complex proper noun annotation within PDT 2.0
  • tectogrammatical layer more convenient than the
    morphological one
  • annotation means and rules proposed
  • future work
  • further elaborate the proposed means and rules
  • manual annotation of sample data
  • development of automatic annotation tools
Write a Comment
User Comments (0)
About PowerShow.com