Mining Text Data presentation

About This Presentation

Transcript and Presenter's Notes

Title: Mining Text Data

1
Mining Text Data

Antje Wolf
S Anwendungen und spezielle Themen in Data Mining
06.07.2005

2
Overview

Introduction
Architecture of Text Mining Systems
Tagging
Statistical Tagging
Semantic Tagging
Structural Tagging
Taxonomy Construction
Implementation Issues
Visualizations and Analytics for Text Mining
Summary
Example of a Text Mining Tool in Bioinformatics
ProMiner

3
Introduction

80 of digital data is nonstructured
much information in textual form with little or
no formatting
growing interest in text mining
different approaches
use entire set of words in documents as input
use tags associated with the documents
information extraction
Workflow

Input
Preprocessing
Data mining algorithms
4
Architecture of Text Mining Systems

3 major components
Business intelligence suite
Intelligent tagging
Information feeders

5
Architecture of Text Mining SystemsIntelligent
tagging component

Statistical Tagging categorization and term
extraction
Semantic Tagging information extraction
Structural Tagging extraction from visual layout
of documents
each tagger separate training module based on
annotated examples

6
Statistical TaggingText Categorization

activity of labeling natural language texts
with thematic categories from a predefined set
knowledge engineering approach
user defines manually a set of rules encoding
expert knowledge how to classify documents under
given categories
typical rule of CONSTRUE (Hayes, 1992) system if
DNF (disjunction of conjunctive clauses) formula
then category
example

If ((wheat farm) OR (wheat commodity) OR
(bushels export) OR (wheat tonnes) ) then
WHEAT else WHEAT
7
Statistical TaggingText Categorization

machine learning approach
training set of documents that are pretagged
using the predefined set of categories

8
Statistical TaggingText Categorization

Example-based classifiers K nearest neighbor
(KNN)
decision if document dj ? category ci depends on
the k training documents most similar to dj
best value for k?
k 20 (Larkey Croft, 1996)
29 lt k lt 46 (Yang Chute, 1994)
distance-weighted version
categorization status value (CSV) is

9
Statistical TaggingText Categorization

Support Vector Machines
hyperplane that separates with maximum margin a
set of positive examples from a set of negative
examples

10
Statistical TaggingTerm Extraction

labeling each document with a set of terms
extracted from the document
Linguistic preprocessing
tokenization identifying text structure at
subparagraph level, that is, word boundaries,
sentence boundaries, dates, abbreviations, etc.
part-of-speech tagging associating
morpho-syntactic categories such as noun,
adjective, verb along with case, number, person
lemmatization assignment of lemma, i.e., the
base form of an inflected word, to every word
token
Term generation
Term filtering

11
Semantic Tagging

allows for mining of the actual information
present within the text
requires trained developers, very laborious
extracted information is specific and precise

12
Semantic Tagging

DIAL (Declarative Information Analysis Laguage)
basic elements
predifined strings, e.g. merger
word class element, e.g. a list of countries
Part-of-speech tag, like noun, adjective
scanner feature, e.g. Capital, HtmlTag
constraints
boolean checks for specific attributes
IE rule bases
logic program
example

FMergerCCM(C1, C2) - Company(Compl)
OptCompanyDetails "and" skip(Company(x),
SkipFail, 10) Company(Comp2) OptCompanyDetails ski
p(WCMergerVerbs, SkipFailComp, 20) WCMergerVerbs
skip(WCMerger, SkipFail, 20) WCMerger
verify(WholeNotInPredicate(Compl, _at_PersonName))
verify(WholeNotInPredicate(Comp2, _at_PersonName))
C1 Compl C2 Comp2
13
Semantic Tagging

rulebooks of DIAL
Financial rulebook (11,500 rules)
can identify more than 50 entity types such as
company names, people names, organizations,
products, locations
can identify events such as mergers, joint
ventures
Business intelligence rulebook (7,000 rules)
Intellectual property rulebook (100 rules)
can identify 30 different types of entities in
patent files
Protein relationship rulebook (500 rules)
can identify 30 different types of entities,
including proteins
can identify 10 different relationships,
including phosphorylation

14
Structural Tagging

ignores content of words
focusing of superficial features, like size and
position on the page
GIVEN
template document A
set of primitives in A (annotated fields),
denoted PA
like "AUTHOR...TITLE..."
query document B
FIND
degree of similarity between A and B
set of primitives in B that corresponds to PA

15
Taxonomy Construction

tree with terms as leaves
enables construction of high-level association
rules
rules between groups of terms rather than between
individual terms
time-consuming task ? need for semiautomatic
construction
there exist many taxonomies for different
domains, such as Gene Ontology which provides a
controlled vocabulary to describe gene and gene
product attributes in any organism

16
Implementation Issues Soft Matching

problem matching synonyms that refer to the same
entity
examples punctuation variations, spelling
mistakes, abbreviations, formal vs informal names
solutions
soundex algorithm can match words that have a
similar phonetic pronunciation
lookup table for all abbreviations and nicknames
of a given entity
coding name conversion rules
example X Corporation and X Corp. are mapped to X

17
Implementation IssuesTemporal Resolution

"time-stamp" documents for temporal analysis
(Trend Graphs)
problems
large variety of possible date formats
relative date formats ("yesterday", "last month")
fuzzy temporal phrases ("in the very near future")

18
Implementation IssuesAnaphora Resolution

resolve coreferences
pronominals ("he", "she", "we")
definite noun phrases ("the ruthless man")
solution
collect all accessible antecedents for each
referring phrase
heuristics Prefer the candidate that appears ...
... earlier in the current sentence.
... earlier in the previous sentence.
... later within other sentences.

19
Visualization and AnalyticsVisualizations

Category Connection Maps
concise visual representation of connections
between different categories (taxonomy nodes),
for example between companies and technologies
user chooses number of categories from the
taxonomy
system finds all connection between terms in
categories

20
Visualization and AnalyticsVisualizations

Relationship Maps
concise representation of the relationships
between many terms in a given context
taxonomy category determines nodes of circle
graph
optional context node determines type of
connection

21
Visualization and AnalyticsVisualizations

Relationship Maps
Spring graph two-dimensional graph in which the
distance of two elements should reflect the
strength of their relationship

22
Visualization and AnalyticsAnalytics

Clustering
identify nodes that are strongly interrelated
find dense subgraphs in a given graph
Trend Graphs
view changes in relationships over time
? identify trends and patterns

23
Summary

due to abundance of available textual data,
growing need for efficient text mining tools
textual data require preprocessing
information extraction method proven to be
efficient for this task
analysis techniques like clustering and trend
graphs in combination with visualization tools
facilitate trend and pattern detection

24
ProMiner

aim detection of protein and gene names in
scientific articles
nomenclature is highly variable and ambiguous
mostly composed entries
phenotypical descriptions as protein names
definition of gene aliases as convenient
abbreviations of corresponding protein names
parallel naming of genes and proteins
ProMiner consists of three parts
dictionary generation
occurrence detection and
filtering of matches

25
ProMinerImplementation

Dictionary generation construction and curation
gene names from HUGO, protein names from
SWISSPROT and TREMBL
definition of token classes for curation of
dictionary and matching procedure
curation expansion and pruning phase
tagging each token in dictionary with
corresponding class
after curation
38,200 entries with151,700 synonyms

26
ProMinerImplementation

Occurence detection
processing one token at a time and keeping a set
of candidate solutions for present position
two scoring measures
boundary scoreincreased on a token mismatch if
rises above a threshold candidate pruned from the
candidate set and checked for reporting
acceptance scoredetermines whether the
candidate is reported as match. linear
combination of token class specific match- and
mismatch termsweights for match terms set to
small value for non-descriptive tokens and high
one for modifier token

27
ProMinerImplementation

Filtering of matches match disambiguation
set of synonyms
overlapping matches match with the higher
acceptance score, the larger fraction of matches
or the largest number of matched tokens is
accepted
ambiguous synonym only those matches for which
most additional synonym occurrences can be found
Parameter optimization
computation of weights with robust linear
programming (RLP)
training set of positive and negative examples
computes separating hyperplane in vector space of
scoring contributions

28
Literature

Giouli, V., Piperidis, S., Current trends in
corpus processing and annotation
http//www.larflast.bas.bg/balric/eng_files/corpor
a7_1.php (Website from 5. Juli 2005)
Hanisch, D., Fluck, J., Mevissen, HT., Zimmer,
R., Playing biologys name game Identifying
protein names in scientific text. Pacific
Symposium on Biocomputing 2003, pages 403-414.
Hanisch, D., Fundel, K., Mevissen, HT., Zimmer,
R., Fluck, J., ProMiner rule-based protein and
gene entity recognition. BMC Bioinformatics 2005,
6(Suppl 1)S14.
Ye, N. (ed.), The Handbook of Data Mining.
Lawrence Erlbaum Publishers, 2003, ch. 21.

Write a Comment

User Comments (0)

About PowerShow.com

Mining Text Data PowerPoint PPT Presentation