Demonstration of Text Mining - PowerPoint PPT Presentation

Loading...

PPT – Demonstration of Text Mining PowerPoint presentation | free to download - id: bf531-ZDc1Z



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Demonstration of Text Mining

Description:

PEntity id='pe1' type='irish republican army' mnem='Irish Republican Army' refid='pn1' ... wrong sometimes if the candidates can be easily substituted in ... – PowerPoint PPT presentation

Number of Views:26
Avg rating:3.0/5.0
Slides: 42
Provided by: keaneanani
Learn more at: http://www.jisc.ac.uk
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Demonstration of Text Mining


1
Demonstration of Text Mining
  • Bill Black
  • National Centre for Text Mining
  • W.Black_at_manchester.ac.uk
  • University of Manchester

2
Text mining tasks and resources
  • Information retrieval
  • Gather, select, filter, documents that may prove
    useful
  • Find what is known
  • Information extraction
  • Partial, shallow language analysis
  • Find relevant entities, facts about entities
  • Find only what looking for
  • Mining
  • Combine, link facts
  • Discover new knowledge, find new facts
  • Resources ontologies, lexicons, terminologies,
    grammars, annotated corpora (machine learning,
    evaluation)

3
CAFETIERE
  • Conceptual annotations for facts, events, terms,
    individual entities and relations
  • Integration of terminological processing and
    ontological resources
  • Rule based temporal processing
  • Scalability issues
  • Distributed data and processing
  • Incremental processing
  • On the fly rapid access to ontologies
  • Annotation and rule editors

4
Common Annotation Scheme
  • XML-based Representation
  • Stores textual, linguistic and metadata
    information.
  • Document Structure
  • ltfrontgt source, title, doc. Timestamp
  • ltbodygt text, sections, paragraphs, sentences,
    tokens.
  • ltbackgt conceptual info name instances, entities,
    relations, events.

5
Annotation Overview
ltfrontgt
Title Abstract Source Date
Section Head Paragraph Sentence Token
ltbodygt
ltbackgt
Lexical Conceptual
6
Lexical Annotations
Person organization location money percent quantit
y definition product role term title consumer pcyc
le bcycle
7
Conceptual Annotationsrelating lexical
annotations
Danisco company
DEFINED AS
PRODUCES
WORKS FOR
DEFINED AS
ROLE PLAYED
8
Common Annotation Scheme Illustration
ltParDocgt ltstage stage"1"/gt ltfrontgt
ltdatelinegt22/11/1999lt/datelinegt
lttitlegt22Nov1999 PHILIPPINES IN BRIEF - Rebels
nab sonlt/titlegt lt/frontgt ltbodygtltsecgtltpgtltsgt
lttok id"t16" pos"NIL" lem"zamboanga"
lookup"NIL" orth"uppercase" zone"" sepAfter"
"gtZAMBOANGAlt/tokgt lttok
id"t17" pos"NIL" lem"city" lookup"NIL"
orth"uppercase" zone"" sepAfter""gtCITYlt/tokgt.
. .lt/sgtlt/pgtlt/secgt. . . lt/bodygt ltbackgt
ltParLexgt ltPNAMEX id"pn3" type"individual"
tokref"t46 t47" features "idIndividual,path
ThingIndividual,name IndividualPosition,Natio
n,Importance"/gt lt/ParLexgt ltParCongt
ltPEntity id"pe1" type"irish republican army"
mnem"Irish Republican Army" refid"pn1"/gt
ltPEvent id"ev1" type"abduction" text"was
abducted" refId"t26 t27" class"OCCURRENCE"
tense"PAST" polarity"POSITIVE" aspect"NONE"
slot1"" slot2"" slot3"" slot4"" slot5""
slot6"" slot7"" slot8"" /gt lt/ParCongt
lt/backgt lt/ParDocgt
9
Text Mining Architecture
10
Parmenides Resource ManagerDefines pipelines,
queues documents between processes
11
Document capture and conversion
  • Web and directory crawling
  • Batch and interactive use
  • Format conversion
  • Word, HTML, PDF, etc. to Common Annotation
    Scheme
  • Text zoning
  • Separate front matter from body text
  • Attempt to annotate headings etc.

12
Cafetiere Information Extraction Framework
  • Texts are analysed at several levels leading to a
    template representation of events
  • Tokenization and tagging
  • Sentence splitting and optional term discovery
  • Ontology or gazetteer lookup
  • Phrasal analysis to classify name expressions
  • Phrasal analysis to fill slots of template
    representations of entities and events.

13
From Tokenization to Semantic Parsing
  • Tokenization
  • words, numbers, punct., tel.nos., chemical
    formulae, etc.
  • Tagging
  • Part of speech labelling disambiguation in local
    context
  • Semantic dictionary/ontology lookup
  • Known names, terms, heads of terms and names
  • Partial parsing
  • Identify phrasal chunks - names, domain terms and
    other NPs temporal elements - tensed verbs,
    adverbials,
  • Semantic information extraction
  • Build template or graph rep. of events/facts

14
The NLP components
  • Part of Speech Tagger
  • Based on Brill algorithm, locally trained from
    publicly available data
  • Ontology Lookup
  • Accesses semantic category and properties of
    application-interesting words and phrases.
  • Rule-based phrasal analyser
  • Finds and labels phrases of application-interest,
    using tag, lookup, orthography and output of
    other rules. Returns feature values as well as
    span labels.

15
Tokenization and POS tagging
  • Tokenizer separates words and other tokens,
    analyzes each orthographically.
  • Transformation-based Learning is used to train a
    part of speech tagger.
  • New Java implementation of Brill algorithm
  • Fast in operation, circa 100K words/sec.
  • Tagger available separately, distributed with
    VisualText.
  • Sentence splitter differentiates sentence
    punctuation from other usages of .?!
  • Next slide shows tabular view of token attributes

16
(No Transcript)
17
Ontology Lookup
  • IE systems typically consult lists of known names
    of places, people, organizations, artifacts,
    etc., and tokens that heuristically indicate
    class of name, e.g. Dr. , Plc.
  • Cafetiere consults a knowledge base, which
    associates ontology class and/or entity
    identifier, as well as slot names and type
    constraints.
  • Previous slide showed ontology class in the
    lookup column of the token attributes.

18
Ontology/KB can be browsed within Cafetiere
19
Phrasal Analysis
  • Cafetiere finds instances of
  • Proper names of people, places, organizations, or
    other application-motivated named entities, e.g.
    genes, proteins.
  • Temporal expressions, including adverbials,
    dates, verb groups.
  • Descriptive phrases, e.g. NPs in apposition to
    names.
  • and classifies them by conceptual category.
  • Next slides show show these can be accessed after
    analysis in a document browser.

20
(No Transcript)
21
Entity extraction
  • Typically, named entities are mentioned several
    times in a text.
  • Cafetiere groups together the instances of each
    named entity when creating a conceptual
    annotation.
  • Resolves some co-references, especially variant
    forms of proper names.
  • An initial entity is created when a name
    expression is found. Later occurrences are added
    to form equivalence classes.
  • Next slide shows the phrasal instances of a
    single entity mentioned in the text.

22
(No Transcript)
23
Event extraction
  • At the phrasal analysis stage, verb groups and
    noun phrases denoting events have timeML features
    assigned.
  • Event extraction is either rule-based or
    ontology-driven.
  • Each event type has a number of conceptual slots,
    and Cafetiere matches conceptually annotated text
    fragments within the sentence, where they match
    the slot types.
  • Next slide shows the event browser after
    analysis, with one events slots in a detail
    view.

24
(No Transcript)
25
Semi-automatic annotation
  • The Cafetiere browser also has editing controls,
    which enable a corrected annotation to be saved.
  • Useful for applications where validated extracted
    data are presented to the end users.
  • Useful for annotating documents to create
    training or evaluation gold standard data.
  • For events, browser shows all compatible entities
    as alternative slot fillers to the one selected
    by the automatic analysis.

26
Viewing Analysis ResultsPress Release Analyst by
Biovista
27
(No Transcript)
28
(No Transcript)
29
Mining Scientific Literature
  • In Parmenides project, a case study has been
    conducted with Unilever on mining scientific
    papers on weight management.
  • In NacTem, work on Term Management focuses on
    domain terminology for biological sciences and
    medicine.
  • Joint work with Lancaster University seeks to
    advance conceptual summarization by extracting
    causal relations expressed in text.

30
Unilever Case Study
  • Weight management experimental papers
  • Template-level representation of key features of
    the study
  • Clinical study subjects
  • Study population
  • Clinical study design
  • Nutritional metabolic phenomenon
  • Work function (effect)
  • Health benefit

31
Phrasal Analysis
  • Same techniques as for news NE analysis
  • Targeting descriptive phrases rather than complex
    proper names.
  • Domain terms important
  • Sentence-based fact extraction wont work,
    because information distributed throughout text.

32
(No Transcript)
33
(No Transcript)
34
Template slot filling
  • Each slot has many candidate fillers
  • Often benignly paraphrases of each other, but can
    emphasize different attributes
  • Selection is heuristic, based on proximity to
    trigger word and similarity of terms.
  • For Unilevers purposes, acceptable to get filler
    wrong sometimes if the candidates can be easily
    substituted in the template editor.
  • Not yet handled properly-multiple slot fillers

35
(No Transcript)
36
Conceptual Abstracting
  • Joint work with Chris Paice at Lancaster
    University.
  • Paice and Jones (1993) pioneered template-based
    abstracting of crop science papers.
  • Similar analysis to Unilever case study, leading
    to a template, and then a generated short
    informative abstract.
  • Major drawback is the domain-specific resource
    development needed.

37
Ameliorating the resource bottleneck
  • One approach (Paice and Oakes) to develop rules
    by supervised machine learning
  • Transfers effort from rule-based analysis to
    corpus annotation.
  • Current approach (Paice and Black) to develop
    domain-independent extraction of causal and other
    key relationships expressed in scientific papers
  • Incorporates term discovery, ontology lookup,
    stemming, tagging. Implemented in Cafetiere
    framework.

38
(No Transcript)
39
NLP Components as Services
  • Stratified processing modules well-defined
  • Common Annotation Scheme facilitates
    interoperability
  • Each module potentially a Web service
  • Queuing between modules should minimize network
    traffic
  • User defines pipeline Processing takes place on
    servers Users share cached common module
    analyses Users access curated data.

40
Conclusion
  • Cafetiere is a framework for the information
    extraction phases of text mining.
  • Incorporates context-sensitive partial parsing of
    names, terms, chunks.
  • Linkage to event ontology enables template slot
    filling and hence fact/event extraction.
  • Domain-specific resources (ontology, rules) can
    be developed for diverse domains business,
    science.

41
Future Developments
  • Use corpus-trained components for phrasal
    analysis of bio-medical literature.
  • Improve rule-application engine performance.
  • Cluster processing to balance IE processing
    against more superficial processes.
About PowerShow.com