Mining Text Data - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Mining Text Data

Description:

Example of a Text Mining Tool in Bioinformatics: ProMiner ... Anaphora Resolution. resolve coreferences. pronominals ('he', 'she', 'we' ... – PowerPoint PPT presentation

Number of Views:55
Avg rating:3.0/5.0
Slides: 29
Provided by: Tsch66
Category:
Tags: anaphora | data | mining | text

less

Transcript and Presenter's Notes

Title: Mining Text Data


1
Mining Text Data
  • Antje Wolf
  • S Anwendungen und spezielle Themen in Data Mining
  • 06.07.2005

2
Overview
  • Introduction
  • Architecture of Text Mining Systems
  • Tagging
  • Statistical Tagging
  • Semantic Tagging
  • Structural Tagging
  • Taxonomy Construction
  • Implementation Issues
  • Visualizations and Analytics for Text Mining
  • Summary
  • Example of a Text Mining Tool in Bioinformatics
    ProMiner

3
Introduction
  • 80 of digital data is nonstructured
  • much information in textual form with little or
    no formatting
  • growing interest in text mining
  • different approaches
  • use entire set of words in documents as input
  • use tags associated with the documents
  • information extraction
  • Workflow

Input
Preprocessing
Data mining algorithms
4
Architecture of Text Mining Systems
  • 3 major components
  • Business intelligence suite
  • Intelligent tagging
  • Information feeders

5
Architecture of Text Mining SystemsIntelligent
tagging component
  • Statistical Tagging categorization and term
    extraction
  • Semantic Tagging information extraction
  • Structural Tagging extraction from visual layout
    of documents
  • each tagger separate training module based on
    annotated examples

6
Statistical TaggingText Categorization
  • activity of labeling natural language texts
    with thematic categories from a predefined set
  • knowledge engineering approach
  • user defines manually a set of rules encoding
    expert knowledge how to classify documents under
    given categories
  • typical rule of CONSTRUE (Hayes, 1992) system if
    DNF (disjunction of conjunctive clauses) formula
    then category
  • example

If ((wheat farm) OR (wheat commodity) OR
(bushels export) OR (wheat tonnes) ) then
WHEAT else WHEAT
7
Statistical TaggingText Categorization
  • machine learning approach
  • training set of documents that are pretagged
    using the predefined set of categories

8
Statistical TaggingText Categorization
  • Example-based classifiers K nearest neighbor
    (KNN)
  • decision if document dj ? category ci depends on
    the k training documents most similar to dj
  • best value for k?
  • k 20 (Larkey Croft, 1996)
  • 29 lt k lt 46 (Yang Chute, 1994)
  • distance-weighted version
  • categorization status value (CSV) is

9
Statistical TaggingText Categorization
  • Support Vector Machines
  • hyperplane that separates with maximum margin a
    set of positive examples from a set of negative
    examples

10
Statistical TaggingTerm Extraction
  • labeling each document with a set of terms
    extracted from the document
  • Linguistic preprocessing
  • tokenization identifying text structure at
    subparagraph level, that is, word boundaries,
    sentence boundaries, dates, abbreviations, etc.
  • part-of-speech tagging associating
    morpho-syntactic categories such as noun,
    adjective, verb along with case, number, person
  • lemmatization assignment of lemma, i.e., the
    base form of an inflected word, to every word
    token
  • Term generation
  • Term filtering

11
Semantic Tagging
  • allows for mining of the actual information
    present within the text
  • requires trained developers, very laborious
  • extracted information is specific and precise

12
Semantic Tagging
  • DIAL (Declarative Information Analysis Laguage)
  • basic elements
  • predifined strings, e.g. merger
  • word class element, e.g. a list of countries
  • Part-of-speech tag, like noun, adjective
  • scanner feature, e.g. Capital, HtmlTag
  • constraints
  • boolean checks for specific attributes
  • IE rule bases
  • logic program
  • example

FMergerCCM(C1, C2) - Company(Compl)
OptCompanyDetails "and" skip(Company(x),
SkipFail, 10) Company(Comp2) OptCompanyDetails ski
p(WCMergerVerbs, SkipFailComp, 20) WCMergerVerbs
skip(WCMerger, SkipFail, 20) WCMerger
verify(WholeNotInPredicate(Compl, _at_PersonName))
verify(WholeNotInPredicate(Comp2, _at_PersonName))
C1 Compl C2 Comp2
13
Semantic Tagging
  • rulebooks of DIAL
  • Financial rulebook (11,500 rules)
  • can identify more than 50 entity types such as
    company names, people names, organizations,
    products, locations
  • can identify events such as mergers, joint
    ventures
  • Business intelligence rulebook (7,000 rules)
  • Intellectual property rulebook (100 rules)
  • can identify 30 different types of entities in
    patent files
  • Protein relationship rulebook (500 rules)
  • can identify 30 different types of entities,
    including proteins
  • can identify 10 different relationships,
    including phosphorylation

14
Structural Tagging
  • ignores content of words
  • focusing of superficial features, like size and
    position on the page
  • GIVEN
  • template document A
  • set of primitives in A (annotated fields),
    denoted PA
  • like "AUTHOR...TITLE..."
  • query document B
  • FIND
  • degree of similarity between A and B
  • set of primitives in B that corresponds to PA

15
Taxonomy Construction
  • tree with terms as leaves
  • enables construction of high-level association
    rules
  • rules between groups of terms rather than between
    individual terms
  • time-consuming task ? need for semiautomatic
    construction
  • there exist many taxonomies for different
    domains, such as Gene Ontology which provides a
    controlled vocabulary to describe gene and gene
    product attributes in any organism

16
Implementation Issues Soft Matching
  • problem matching synonyms that refer to the same
    entity
  • examples punctuation variations, spelling
    mistakes, abbreviations, formal vs informal names
  • solutions
  • soundex algorithm can match words that have a
    similar phonetic pronunciation
  • lookup table for all abbreviations and nicknames
    of a given entity
  • coding name conversion rules
  • example X Corporation and X Corp. are mapped to X

17
Implementation IssuesTemporal Resolution
  • "time-stamp" documents for temporal analysis
    (Trend Graphs)
  • problems
  • large variety of possible date formats
  • relative date formats ("yesterday", "last month")
  • fuzzy temporal phrases ("in the very near future")

18
Implementation IssuesAnaphora Resolution
  • resolve coreferences
  • pronominals ("he", "she", "we")
  • definite noun phrases ("the ruthless man")
  • solution
  • collect all accessible antecedents for each
    referring phrase
  • heuristics Prefer the candidate that appears ...
  • ... earlier in the current sentence.
  • ... earlier in the previous sentence.
  • ... later within other sentences.

19
Visualization and AnalyticsVisualizations
  • Category Connection Maps
  • concise visual representation of connections
    between different categories (taxonomy nodes),
    for example between companies and technologies
  • user chooses number of categories from the
    taxonomy
  • system finds all connection between terms in
    categories

20
Visualization and AnalyticsVisualizations
  • Relationship Maps
  • concise representation of the relationships
    between many terms in a given context
  • taxonomy category determines nodes of circle
    graph
  • optional context node determines type of
    connection

21
Visualization and AnalyticsVisualizations
  • Relationship Maps
  • Spring graph two-dimensional graph in which the
    distance of two elements should reflect the
    strength of their relationship

22
Visualization and AnalyticsAnalytics
  • Clustering
  • identify nodes that are strongly interrelated
  • find dense subgraphs in a given graph
  • Trend Graphs
  • view changes in relationships over time
  • ? identify trends and patterns

23
Summary
  • due to abundance of available textual data,
    growing need for efficient text mining tools
  • textual data require preprocessing
  • information extraction method proven to be
    efficient for this task
  • analysis techniques like clustering and trend
    graphs in combination with visualization tools
    facilitate trend and pattern detection

24
ProMiner
  • aim detection of protein and gene names in
    scientific articles
  • nomenclature is highly variable and ambiguous
  • mostly composed entries
  • phenotypical descriptions as protein names
  • definition of gene aliases as convenient
    abbreviations of corresponding protein names
  • parallel naming of genes and proteins
  • ProMiner consists of three parts
  • dictionary generation
  • occurrence detection and
  • filtering of matches

25
ProMinerImplementation
  • Dictionary generation construction and curation
  • gene names from HUGO, protein names from
    SWISSPROT and TREMBL
  • definition of token classes for curation of
    dictionary and matching procedure
  • curation expansion and pruning phase
  • tagging each token in dictionary with
    corresponding class
  • after curation
  • 38,200 entries with151,700 synonyms

26
ProMinerImplementation
  • Occurence detection
  • processing one token at a time and keeping a set
    of candidate solutions for present position
  • two scoring measures
  • boundary scoreincreased on a token mismatch if
    rises above a threshold candidate pruned from the
    candidate set and checked for reporting
  • acceptance scoredetermines whether the
    candidate is reported as match. linear
    combination of token class specific match- and
    mismatch termsweights for match terms set to
    small value for non-descriptive tokens and high
    one for modifier token

27
ProMinerImplementation
  • Filtering of matches match disambiguation
  • set of synonyms
  • overlapping matches match with the higher
    acceptance score, the larger fraction of matches
    or the largest number of matched tokens is
    accepted
  • ambiguous synonym only those matches for which
    most additional synonym occurrences can be found
  • Parameter optimization
  • computation of weights with robust linear
    programming (RLP)
  • training set of positive and negative examples
  • computes separating hyperplane in vector space of
    scoring contributions

28
Literature
  • Giouli, V., Piperidis, S., Current trends in
    corpus processing and annotation
    http//www.larflast.bas.bg/balric/eng_files/corpor
    a7_1.php (Website from 5. Juli 2005)
  • Hanisch, D., Fluck, J., Mevissen, HT., Zimmer,
    R., Playing biologys name game Identifying
    protein names in scientific text. Pacific
    Symposium on Biocomputing 2003, pages 403-414.
  • Hanisch, D., Fundel, K., Mevissen, HT., Zimmer,
    R., Fluck, J., ProMiner rule-based protein and
    gene entity recognition. BMC Bioinformatics 2005,
    6(Suppl 1)S14.
  • Ye, N. (ed.), The Handbook of Data Mining.
    Lawrence Erlbaum Publishers, 2003, ch. 21.
Write a Comment
User Comments (0)
About PowerShow.com