Supporting%20Annotation%20Layers%20for%20Natural%20Language%20Processing - PowerPoint PPT Presentation

About This Presentation
Title:

Supporting%20Annotation%20Layers%20for%20Natural%20Language%20Processing

Description:

... Annotation Layers for Natural Language Processing. Archana Ganapathi, Preslav ... Most natural language processing (NLP) algorithms make use of the results of ... – PowerPoint PPT presentation

Number of Views:46
Avg rating:3.0/5.0
Slides: 38
Provided by: Sar1
Category:

less

Transcript and Presenter's Notes

Title: Supporting%20Annotation%20Layers%20for%20Natural%20Language%20Processing


1
Supporting Annotation Layers for Natural Language
Processing
  • Archana Ganapathi, Preslav Nakov, Ariel
    Schwartz, and Marti HearstComputer Science
    Division and SIMSUniversity of California,
    Berkeley

2
Motivation
  • Most natural language processing (NLP) algorithms
    make use of the results of previous processing
    steps, e.g.
  • Tokenizer
  • Part-of-speech tagger
  • Phrase boundary recognizer
  • Syntactic parser
  • Semantic tagger
  • No standard way to represent, store and retrieve
    text annotations efficiently.
  • MEDLINE has close to 13 million abstracts. Full
    text starts to become available as well.

3
Text Annotation Framework
  • Annotations are stored independently of text in
    an RDBMS
  • Declarative query language for annotation
    retrieval
  • Indexing structure designed for efficient query
    processing
  • Object Oriented API for annotations insertion,
    deletion and modification

4
Key Contributions
  • Support for hierarchical and overlapping layers
    of annotation
  • Querying multiple levels of annotations
    simultaneously
  • First to evaluate different physical database
    designs
  • Focused on scaling annotation-based queries to
    very large corpora with many layers of
    annotations
  • We propose a query language and demonstrate its
    power and the efficiency of the indexing
    architecture on a wide variety of query types
    that have been published in the NLP literature.

5
Outline
  • Related Work
  • Layered Query Language
  • Database Design
  • API
  • Evaluation
  • Conclusions

6
Related Work
  • Annotation graphs (AG) directed acyclic graph
    nodes can have time stamps or are constrained via
    paths to labeled parents and children. (Bird and
    Liberman, 2001)
  • Emu system sequential levels of annotations.
    Hierarchical relations may exist between
    different levels, but must be explicitly defined
    for each pair.(CassidyHarrington,2001)
  • The Q4M query language for MATE directed graph
    constraints and ordering of the annotated
    components. Stored in XML (McKelvieal., 2001)
  • TIQL queries consist of manipulating intervals
    of text, indicated by XML tags supports set
    operations. (Nenadic et al., 2002)

7
Outline
  • Related Work
  • Layered Query Language
  • Database Design
  • API
  • Evaluation
  • Conclusions

8
Layers of Annotations
9
Layers of Annotations
10
Layers of Annotations
11
Layers of Annotations
Full parse, sentence and section layers are not
shown.
12
Layers of Annotation (cont.)
  • Each annotation represents an interval spanning a
    sequence of characters
  • absolute start and end positions
  • Each layer corresponds to a conceptually
    different kind of annotation
  • i.e., word, gene/protein, shallow parse
  • can have several layers with the same semantics
  • Layers can be
  • sequential
  • overlapping
  • e.g., two multiple-word concepts sharing a word
  • hierarchical
  • spanning, when the intervals are nested as in a
    parse tree, or
  • ontologically, when the token itself is derived
    from a hierarchical ontology

13
Layer Type Properties
  • One-to-one correspondence between the Word and
    the Part-of-speech (POS) layers.
  • The Word, POS and Shallow parse layers are
    sequential
  • The Full parse layer is spanning hierarchical
  • The Gene/protein layer assigns IDs from the
    LocusLink database of gene names
  • many-to-one in the case of multiple species
  • The Ontology layer assigns terms from the
    hierarchical medical ontology MeSH (Medical
    Subject Headings)
  • Overlapping (share the word cell) and
    hierarchical
  • both spanning, since blood cell (with MeSH ID
    D001773) spans cell (which is also in MeSH), and
  • ontologically, since blood cell is a kind of cell
    and cell death (D016923) is a type of Biological
    Phenomena.

14
Layered Query Language
  • Requirements for the query language on layers of
    annotations
  • Intuitive
  • Compact
  • Declarative
  • Expressive power for real world queries
  • Support for hierarchical and overlapping
    annotations
  • Compatible with SQL
  • LQL (Layered Query Language)
  • XML-like
  • Can be translated to SQL to run against an RDBMS
  • Tested on real world bioscience NLP applications

15
LQL by Example
A01 A07 limbvein shoulder artery
16
LQL Syntax
  • lt gt Defines an arbitrary range over text.
  • A range is typically restricted to a specific
    layer type using ltlayer_namegt.
  • All layers have a lex (the text spanned by the
    range) and a tag_type attribute.
  • Predicates on attribute values are enclosed in
    square brackets,i.e. ltlayer_name
    attribute_name ! gt gt lt lt
    valuegt.
  • The language supports the boolean operators
    conjunction (), disjunction (), and negation
    (!).
  • By default tokens must follow each other
    immediately.
  • The ellipses (...) indicate that tokens may
    intervene in between the specified ranges.
  • A range is optionally followed by an action
    statement, enclosed in curly braces, which binds
    variables or specifies what should be printed,
    e.g. ltgene_proteingt print tag_type.
  • With no arguments, print outputs the value of the
    lex attribute.
  • The two special characters ˆ and are used
    to match the ranges beginning and end positions,
    respectively.
  • is used as a wildcard can be used to descend
    an ontological hierarchy.

17
Additional LQL Features
  • For spanning hierarchical layers we can have
    hierarchical queries with several nested
    references to the same layer. The following query
    finds a PP of the form prepositionNP and prints
    that NP ltfull_parse tag_typePP
  • ˆltpos tag_typeprepgt ltfull_parse
    tag_typeNPgtprint gt
  • The keyword noorder allows an arbitrary order for
    the tokens within a range, e.g. ltsentence
    noorder
  • ltgene_proteingt
  • ltpos tag_typeverbgt
  • gt print sentence
  • The language allows for a combination of ordered
    and unordered constraints. For example, ltsentence
    noorder
  • ltgene_proteingt
  • ( ltpos tag_typeverb ltword lexbindsgtgt
    ltpos tag_typeprep ltword lextogtgt )
  • gt print sentence
  • LQL currently does not support a range overlap
    operator.

18
LQL and SQL
  • LQL can be automatically translated into SQL
    (although this is not yet implemented), as
  • user-defined function, or
  • a macro
  • The result of an LQL query is a relation
  • Thus, allowing the use of standard SQL syntax
    such as GROUP BY, COUNT, DISTINCT, ORDER BY,
    UNION etc.
  • An added advantage of LQL over SQL is that the
    LQL queries do not need to be modified, if the
    underlying logical design is changed.
  • LQL is still a work in progress
  • We plan to assess it via usability studies with
    computational linguistics researchers, modifying
    it as necessary.
  • However, we feel it is more intuitive and easier
    to use for text processing than the existing
    languages.

19
LQL Versus SQL
20
Outline
  • Related Work
  • Layered Query Language
  • Database Design
  • API
  • Evaluation
  • Conclusions

21
Database Design
  • We evaluated 5 different logical and physical
    database designs.
  • The basic model is similar to the one of TIPSTER
    (Grishman, 1996). Each annotation is stored as a
    record in a relation.
  • Architecture 1 contains the following columns
  • docid document ID
  • section title, abstract or body text
  • layer_id a unique identifier of the annotation
    layer
  • start_char_pos starting character position,
    relative to particular section and docid
  • end_char_pos end character position, relative to
    particular section and docid
  • tag_type a layer-specific token unique
    identifier.
  • There is a separate table mapping token IDs to
    entities (the string in case of a word, the MeSH
    label(s) in case of a MeSH term etc.)

22
Database Design (cont.)
  • Architecture 2 introduces one additional column,
    sequence_pos, thus defining an ordering for each
    layer.
  • Simplifies some SQL queries as there is no need
    for NOT EXISTS self joins, which are required
    under Architecture 1 in cases where tokens from
    the same layer must follow each other
    immediately.
  • Architecture 3 adds sentence_id, which is the
    number of the current sentence and redefines
    sequence_pos as relative to both layer_id and
    sentence_id.
  • Simplifies most queries since they are often
    limited to the same sentence.

23
Database Design (cont.)
  • Architecture 4 merges the word and POS layers,
    and adds word_id assuming a one-to-one
    correspondence between them.
  • Reduces the number of stored annotations and the
    number of joins in queries with both word and POS
    constraints.
  • Architecture 5 replaces sequence_pos with
    first_word_pos and last_word_pos, which
    correspond to the sequence_pos of the first/last
    word covered by the annotation.
  • Requires all annotation boundaries to coincide
    with word boundaries.
  • Copes naturally with adjacency constraints
    between different layers.
  • Allows for a simpler indexing structure.

24
An Example Relation
Example Kinase inhibits RAG-1.
WORD
SENTE
SEQUE
TAG
END
START
LAYER
SECTION
PMID
WORD
SENTE
SEQUE
TAG
END
START
LAYER
SECTION
PMID
FIRST WORD POS
LAST WORD POS
NCE
NCE
CHAR
NCE
NCE
CHAR
ID
TYPE
CHAR
ID
ID
TYPE
CHAR
ID
POS
POS
POS
POS
POS
POS
0 (word)
59571
2
59571
39
34
b (body)
3345
59571
1
59571
39
34
b (body)
3345
1
1
55608
2
55608
48
41
0
b
3345
55608
2
55608
48
41
0
b
3345
2
2
89985
2
89985
54
50
0
b
3345
89985
3
89985
54
50
0
b
3345
3
3
59571
2
27 (NN)
39
34
1 (POS)
b
3345
59571
1
27 (NN)
39
34
1 (POS)
b
3345
1
1
55608
2
53 (VB)
48
41
1
b
3345
55608
2
53 (VB)
48
41
1
b
3345
2
2
89985
2
27
54
50
1
b
3345
89985
3
27
54
50
1
b
3345
3
3
2
31(NP)
39
34
3(s.parse)
b
3345
1
31(NP)
39
34
3(s.parse)
b
3345
1
1
2
59(VP)
48
41
3
b
3345
2
59(VP)
48
41
3
b
3345
2
2
2
31
54
50
3
b
3345
3
31
54
50
3
b
3345
3
3
1
1
2
39(prt)
39
34
5 (gene)
b
3345
1
39(prt)
39
34
5 (gene)
b
3345
2
39
54
50
5
b
3345
39
54
50
5
b
3345
2
3
3
2
10770
39
34
6(mesh)
b
3345
1
10770
39
34
6(mesh)
b
3345
1
1
3
3
2
16654
54
50
6
b
3345
2
16654
54
50
6
b
3345
Basic architecture
Added, architecture 3
Added, architecture 5
Added, architecture 2
Added, architecture 4
25
Indexing Structure
  • Two types of composite indexes forward and
    inverted.
  • An index lookup can be performed on any column
    combination that corresponds to an index prefix.
  • The forward indexes support lookup based on
    position in a given document.
  • The inverted indexes support lookup based on
    annotation values (i.e., tag type and word id).
  • Most query plans involve both forward and
    inverted indexes
  • Joins statistics would have been useful
  • Detailed statistics are essential.
  • Standard statistics in DB2 are insufficient.
  • Records are clustered on their primary key

26
Indexing Structure (cont.)
Architecture Type Columns
Arch 1-4 F DOCID SECTION LAYER_ID START_CHAR_POS END_CHAR_POS TAG_TYPE
Arch 1-4 I LAYER_ID TAG_TYPE DOCID SECTION START_CHAR_POS END_CHAR_POS
Arch 2 F DOCID SECTION LAYER_ID SEQUENCE POS TAG_TYPE START_CHAR_POS END_CHAR_POS
Arch 2 I LAYER_ID TAG_TYPE DOCID SECTION SEQUENCE POS START_CHAR_POS END_CHAR_POS
Arch 3-4 F DOCID SECTION LAYER_ID SENTENCE SEQUENCE POS TAG_TYPE START_CHAR_POS END_CHAR_POS
Arch 3-4 I LAYER_ID TAG_TYPE DOCID SECTION SENTENCE SEQUENCE POS START_CHAR_POS END_CHAR_POS
Arch 4 I WORD ID LAYER_ID TAG_TYPE DOCID SECTION START_CHAR_POS END_CHAR_POS SENTENCE SEQUENCE POS
Arch 5 F DOCID SECTION LAYER_ID SENTENCE FIRST_WORD_POS LAST_WORD_POS TAG_TYPE
Arch 5 I LAYER_ID TAG_TYPE DOCID SECTION SENTENCE FIRST_WORD_POS LAST_WORD_POS
Arch 5 I WORD ID LAYER_ID TAG_TYPE DOCID SECTION SENTENCE FIRST_WORD_POS
27
Outline
  • Related Work
  • Layered Query Language
  • Database Design
  • API
  • Evaluation
  • Conclusions

28
API
  • Java based API allows for simple insertion,
    deletion and modification of annotations.
  • Need to specify document ID, section, layer ID,
    and positional information.
  • Supports editing a collection of annotations and
    storing them back to the database.
  • We plan to develop a user interface for viewing,
    editing and querying annotations.
  • Not a trivial task, since there are many HCI
    issues on how to display annotations effectively.

29
Outline
  • Related Work
  • Layered Query Language
  • Database Design
  • API
  • Evaluation
  • Conclusions

30
Experimental Setup
  • Annotated 13,504 MEDLINE abstracts
  • Stanford Lexicalized Parser (Klein and Manning,
    2003) for sentence splitting, word tokenization,
    POS tagging and parsing.
  • We wrote a shallow parser and tools for gene and
    MeSH term recognition.
  • This resulted in 10,910,243 records stored in an
    IBM DB2 Universal Database Server.
  • Defined 4 workloads based on variants of queries
    (a-d).

Workload (a) (b) (c) (d)
Queries 54 11 50 1
Results/query 303.4 77.5 1.6 16,701
LQL lines 8 6 5 4
31
Results
Workload (a) (a) (a) (a) (a) (b) (b) (b) (b) (b)
Architecture 1 2 3 4 5 1 2 3 4 5
SQL lines 37 37 34 29 29 91 77 75 65 50
Joins 6 6 6 5 5 12 11 11 9 7
Time (sec) 3.98 4.35 3.59 1.69 1.94 3.88 5.68 5.41 3.85 3.55
Workload (c) (c) (c) (c) (c) (d) (d) (d) (d) (d)
Architecture 1 2 3 4 5 1 2 3 4 5
SQL lines 45 38 38 39 41 59 50 53 53 35
Joins 7 6 6 6 6 7 7 7 7 4
Time (sec) 17.9 23.42 21.49 30.07 4.06 1,879 1,700 2,182 1,682 1,582
Architecture Architecture Architecture Architecture Architecture
Space (MB) 1 2 3 4 5
Data Storage 168.5 168.5 168.5 132.5 136.5
Index Storage 617.0 1,397.0 1,441.0 1,182.0 673.5
Total Storage 785.5 1,565.5 1,609.5 1,314.5 810.0
32
Results (cont.)
  • Different architectures are optimized for
    different types of queries.
  • Architecture 5 performs well (if not best) on all
    query types, while the other architectures
    perform poorly on at least one query type.
  • Storage requirement of Architecture 5 is
    comparable to that of Architecture 1
  • Architecture 5 results in much simpler queries
  • We recommend Architecture 5 in most cases, or
    Architecture 1, if atomic annotation layer
    cannot be defined.

33
Scalability Analysis
  • Combined workload of 3 query types
  • Varying buffer pool sizes

Buffer Pool Size (MB) Elapsed Time (ms) Buffer Read Time (ms)
1000 2300 1050
100 2900 1670
10 4600 3340
1 8300 6250
  • Suggests that the query execution time grows as a
    sub-linear function of memory size.
  • We believe a similar ratio will be observed when
    increasing the database size and keeping the
    memory size fixed
  • Parallel query execution can be enabled after
    partitioning the annotation on document_id

34
Conclusions
  • Provided a mechanism to effectively store and
    query layers of textual annotations.
  • Evaluated various structures for data storage and
    have arrived at an efficient and simple one.
  • Used variations of queries drawn from published
    research, to ensure the real-world applicability.
  • Presented a concise language (LQL) to express
    queries that span multiple levels of the
    annotation structure, which captures the users
    intent better as the syntax is more intuitive and
    closely resembles the annotation structure.

35
Future Work
  • Conduct a usability study to assess the query
    language.
  • Automate the LQL to SQL translation process.
  • Test the scalability of this approach on larger
    document collections.

36
References
  • Steven Bird and Mark Liberman. 2001. A formal
    framework for linguistic annotation. Speech
    Communication, 33(12)2360.
  • Steve Cassidy and Jonathan Harrington. 2001.
    Speech annotation and corpus tools. Speech
    Communication, 33(12)6177.
  • David McKelvie, Amy Isard, Andreas Mengel, Morten
    B. Moller, Michael Grosse and Marion Klein. 2001.
    Speech annotation and corpus tools. Speech
    Communication, 33(12)97112.
  • Goran Nenadic, Hideki Mima, Irena Spasic, Sophia
    Ananiadou and Jun-ichi Tsujii. 2002.
    Terminology-Driven Literature Mining and
    Knowledge Acquisition in Biomedicine.
    International Journal of Medical Informatics,
    673348.
  • Ralph Grishman. 1996. Building an Architecture a
    CAWG Saga. Advances in Text Processing Tipster
    Program Phase II, Morgan Kaufmann, 1996.
  • Steve Cassidy. 1999. Compiling Multi-tiered
    Speech Databases into the Relational Model
    Experiments with the Emu System. 6th European
    Conference on Speech Communication and Technology
    Eurospeech 99, 21272130, Budapest, Hungary.
  • Xiaoyi Ma, Haejoong Lee, Steven Bird and Kazuaki
    Maeda. 2002. Models and Tools for Collaborative
    Annotation. Third International Conference on
    Language Resources and Evaluation, 20662073.

37
Thank You
  • Questions and
  • constructive comments
  • are welcomed
  • http//biotext.berkeley.edu
Write a Comment
User Comments (0)
About PowerShow.com