Scalable Information Extraction - PowerPoint PPT Presentation

1 / 90
About This Presentation
Title:

Scalable Information Extraction

Description:

(e.g., drug info, WHO drug adverse effects DB, etc) Medical ... Air Canada. 0.8. Richardson. 7th Level. 1. Santa Clara. Intel. 0.8. Santa Clara. 3Com Corp ... – PowerPoint PPT presentation

Number of Views:65
Avg rating:3.0/5.0
Slides: 91
Provided by: EugeneAg8
Category:

less

Transcript and Presenter's Notes

Title: Scalable Information Extraction


1
Scalable Information Extraction
  • Eugene Agichtein

2
Example Angina treatments
Structured databases (e.g., drug info, WHO drug
adverse effects DB, etc)
Medical reference and literature
Web search results
3
Research Goal
  • Accurate, intuitive, and efficient access to
    knowledge in unstructured sources
  • Approaches
  • Information Retrieval
  • Retrieve the relevant documents or passages
  • Question answering
  • Human Reading
  • Construct domain-specific verticals (MedLine)
  • Machine Reading
  • Extract entities and relationships
  • Network of relationships Semantic Web

4
Semantic Relationships Buried in Unstructured
Text
RecommendedTreatment
A number of well-designed and -executed
large-scale clinical trials have now shown that
treatment with statins reduces recurrent
myocardial infarction, reduces strokes, and
lessens the need for revascularization or
hospitalization for unstable angina pectoris
  • Web, newsgroups, web logs
  • Text databases (PubMed, CiteSeer, etc.)
  • Newspaper Archives
  • Corporate mergers, succession, location
  • Terrorist attacks

5
What Structured Representation Can Do for You
Structured Relation
  • allow precise and efficient querying
  • allow returning answers instead of documents
  • support powerful query constructs
  • allow data integration with (structured) RDBMS
  • provide useful content for Semantic Web

6
Challenges in Information Extraction
  • Portability
  • Reduce effort to tune for new domains and tasks
  • MUC systems experts would take 8-12 weeks to
    tune
  • Scalability, Efficiency, Access
  • Enable information extraction over large
    collections
  • 1 sec / document 5 billion docs 158 CPU years
  • Approach learn from data ( Bootstrapping )
  • Snowball Partially Supervised Information
    Extraction
  • Querying Large Text Databases for Efficient
    Information Extraction

7
Outline
  • Snowball partially supervised information
    extraction (overview and key results)
  • Effective retrieval algorithms for information
    extraction (in detail)
  • Current mining user behavior for web search
  • Future work

8
The Snowball System Overview
Snowball
... ... ..
9
Snowball Getting User Input
ACM DL 2000
  • User input
  • a handful of example instances
  • integrity constraints on the relation e.g.,
    Organization is a key, Age 0, etc

10
Snowball Finding Example Occurrences
Can use any full-text search engine
Search Engine
Computer servers at Microsofts headquarters in
Redmond In mid-afternoon trading, shares of
Redmond, WA-based Microsoft Corp The
Armonk-based IBM introduced a new lineChange
of guard at IBM Corporations headquarters near
Armonk, NY ...
11
Snowball Tagging Entities
Named entity taggers can recognize Dates, People,
Locations, Organizations, MITREs Alembic,
IBMs Talent, LingPipe,
Computer servers at Microsoft s headquarters in
Redmond In mid-afternoon trading, shares of
Redmond, WA -based Microsoft Corp The Armonk
-based IBM introduced a new lineChange of
guard at IBM Corporations headquarters near
Armonk, NY ...
12
Snowball Extraction Patterns
  • General extraction pattern model
  • acceptor0, Entity,
    acceptor1, Entity, acceptor2
  • Acceptor instantiations
  • String Match (accepts string s headquarters
    in)
  • Vector-Space ( vector (-s,0.5), (headquarters,
    0.5), (in, 0.5) )
  • Classifier (estimate P(Tvalid s,
    headquarters, in) )

13
Snowball Generating Patterns
Represent occurrences as vectors of tags and terms
1
Cluster similar occurrences.
2
LOCATION
ORGANIZATION
, ,

14
Snowball Generating Patterns
Represent occurrences as vectors of tags and terms
1
Cluster similar occurrences.
2
Create patterns as filtered cluster centroids
3
LOCATION
ORGANIZATION
,

15
Snowball Extracting New Tuples
Match tagged text fragments against patterns
Google 's new headquarters in
Mountain View are
V
LOCATION
ORGANIZATION
, , 0.5


LOCATION
,

P2
Match0.4
ORGANIZATION
ORGANIZATION
,

P1
Match0.8
LOCATION
LOCATION
,

ORGANIZATION
P3
Match0
16
Snowball Evaluating Patterns
Automatically estimate pattern confidenceConf(P4
) Positive / Total 2/3 0.66
Current seed tuples
LOCATION
ORGANIZATION


P4
?
IBM, Armonk, reported Positive Intel,
Santa Clara, introduced... Positive Bet on
Microsoft, New York-based analyst Jane Smith
said... Negative
?
x
17
Snowball Evaluating Tuples
Automatically evaluate tuple confidence Conf(T)
A tuple has high confidence if generated by
high-confidence patterns.
Conf(T) 0.83
ORGANIZATION

P4 0.66
LOCATION

3COM Santa Clara
0.4
,

LOCATION
ORGANIZATION
0.8
P3 0.95
18
Snowball Evaluating Tuples

... .... ..

... .... ..
Keep only high-confidence tuples for next
iteration
19
Snowball Evaluating Tuples

Start new iteration with expanded example set
Iterate until no new tuples are extracted
20
Pattern-Tuple Duality
  • A good tuple
  • Extracted by good patterns
  • Tuple weight ? goodness
  • A good pattern
  • Generated by good tuples
  • Extracts good new tuples
  • Pattern weight ? goodness
  • Edge weight
  • Match/Similarity of tuple context to pattern

21
How to Set Node Weights
  • Constraint violation (from before)
  • Conf(P) Log(Pos) Pos/(PosNeg)
  • Conf(T)
  • HITS Hassan et al., EMNLP 2006
  • Conf(P) ?Conf(T)
  • Conf(T) ?Conf(P)
  • URNS Downey et al., IJCAI 2005
  • EM-Spy Agichtein, SDM 2006
  • Unknown tuples Neg
  • Compute Conf(P), Conf(T)
  • Iterate

22
Snowball EM-based Pattern Evaluation
23
Evaluating Patterns and Tuples Expectation
Maximization
  • EM-Spy Algorithm
  • Hide labels for some seed tuples
  • Iterate EM algorithm to convergence on
    tuple/pattern confidence values
  • Set threshold t such that (t 90 of spy
    tuples)
  • Re-initialize Snowball using new seed tuples

..
24
Adapting Snowball for New Relations
  • Large parameter space
  • Initial seed tuples (randomly chosen, multiple
    runs)
  • Acceptor features words, stems, n-grams,
    phrases, punctuation, POS
  • Feature selection techniques OR, NB, Freq,
    support, combinations
  • Feature weights TFIDF, TF, TFNB, NB
  • Pattern evaluation strategies NN, Constraint
    violation, EM, EM-Spy
  • Automatically estimate parameter values
  • Estimate operating parameters based on
    occurrences of seed tuples
  • Run cross-validation on hold-out sets of seed
    tuples for optimal perf.
  • Seed occurrences that do not have close
    neighbors are discarded

25
Example Task 1 DiseaseOutbreaks
SDM 2006
Proteus 0.409 Snowball 0.415
26
Example Task 2 Bioinformaticsa.k.a. mining the
bibliome
ISMB 2003
APO-1, also known as DR6MEK4, also called
SEK1
  • 100,000 gene and protein synonyms extracted from
    50,000 journal articles
  • Approximately 40 of confirmed synonyms not
    previously listed in curated authoritative
    reference (SWISSPROT)

27
Snowball Used in Various Domains
  • News NYT, WSJ, AP DL00, SDM06
  • CompanyHeadquarters, MergersAcquisitions,
    DiseaseOutbreaks
  • Medical literature PDRHealth, Micromedex
    Thesis
  • AdverseEffects, DrugInteractions,
    RecommendedTreatments
  • Biological literature GeneWays corpus ISMB03
  • Gene and Protein Synonyms

28
Limits of Bootstrapping for Extraction
CIKM 2005
  • Task easy when context term distributions
    diverge from background
  • Quantify as relative entropy (Kullback-Liebler
    divergence)
  • After calibration, metric predicts if
    bootstrapping likely to work

29
Few Relations Cover Common Questions
SIGIR 2005
  • 25 relations cover 50 of question types, 5
    relations cover 55 question instances

30
Outline
  • Snowball, a domain-independent, partially
    supervised information extraction system
  • Retrieval algorithms for scalable information
    extraction
  • Current mining user behavior for web search
  • Future work

31
Extracting A Relation From a Large Text Database
InformationExtraction System
StructuredRelation
  • Brute force approach feed all docs to
    information extraction system
  • Only a tiny fraction of documents are often
    useful
  • Many databases are not crawlable
  • Often a search interface is available, with
    existing keyword index
  • How to identify useful documents?

32
Accessing Text DBs via Search Engines
InformationExtraction System
Search Engine
  • Search engines impose limitations
  • Limit on documents retrieved per query
  • Support simple keywords and phrases
  • Ignore stopwords (e.g., a, is)

StructuredRelation
33
QXtract Querying Text Databases for Robust
Scalable Information EXtraction
User-Provided Seed Tuples
Query Generation
Queries
Promising Documents
Information Extraction System
Problem Learn keyword queries to retrieve
promising documents
Extracted Relation
34
Learning Queries to Retrieve Promising Documents
User-Provided Seed Tuples
  • Get document sample with likely negative and
    likely positive examples.
  • Label sample documents using information
    extraction system as oracle.
  • Train classifiers to recognize useful
    documents.
  • Generate queries from classifier model/rules.

Seed Sampling
Information Extraction System
Classifier Training
Query Generation
Queries
35
Training Classifiers to Recognize Useful
Documents

D1
Document features words

D2
-
D3
-
D4
Ripper
SVM
Okapi (IR)
products
disease AND reported USEFUL
disease
exported
reported
used
epidemic
far
infected
virus
36
Generating Queries from Classifiers
SVM
Ripper
Okapi (IR)
disease AND reported USEFUL
products
disease
exported
reported
epidemic
used
infected
far
virus
epidemicvirus
virusinfected
disease AND reported
QCombined
disease and reportedepidemicvirus
37
SIGMOD 2003 Demonstration
38
Tuples A Simple Querying Strategy
Ebola and Zaire
Search Engine
InformationExtraction System
  • Convert given tuples into queries
  • Retrieve matching documents
  • Extract new tuples from documents and iterate

39
Comparison of Document Access Methods
QXtract 60 of relation extracted from 10 of
documents of 135,000 newspaper article
database Tuples strategy Recall at most 46
40
How to choose the best strategy?
  • Tuples Simple, no training, but limited recall
  • QXtract Robust, but has training and query
    overhead
  • Scan No overhead, but must process all documents

41
Predicting Recall of Tuples Strategy
WebDB 2003
Seed Tuple
Seed Tuple
SUCCESS!
FAILURE ?
Can we predict if Tuples will succeed?
42
Abstract the problem Querying Graph
Tuples
Documents
Ebola and Zaire
t1
Search Engine
d1
t2
d2
t3
d3
t4
d4
t5
d5
Note Only top K docs returned for each query.
? retrieves many documents that
do not contain tuples ? searching for an
extracted tuple may not retrieve source document
43
Information Reachability Graph
Tuples
Documents
t1
t1
d1
t2
t3
d2
t2
t3
d3
t4
t5
t4
d4
t1 retrieves document d1 that contains t2
t2, t3, and t4 reachable from t1
t5
d5
44
Connected Components
Reachable Tuples, do not retrieve tuples in Core
Tuples that retrieve other tuples and themselves
Tuples that retrieve other tuples but are not
reachable
45
Sizes of Connected Components
How many tuples are in largest Core Out?
Out
In
Out
In
Core
Core
t0
Out
In
(strongly
Core
connected)
  • Conjecture
  • Degree distribution in reachability graphs
    follows power-law.
  • Then, reachability graph has at most one giant
    component.
  • Define Reachability as Fraction of tuples in
    largest Core Out

46
NYT Reachability Graph Outdegree Distribution
Matches the power-law distribution
MaxResults10
MaxResults50
47
NYT Component Size Distribution
Not reachable
reachable
MaxResults10
MaxResults50
CG / T 0.297
CG / T 0.620
48
Connected Components Visualization
DiseaseOutbreaks, New York Times 1995
49
Estimating Reachability
  • In a power-law random graph G a giant component
    CG emerges if d (the average outdegree) 1,
    and
  • Estimate Reachability CG / T
  • Depends only on d (average outdegree)

Chung and Lu, Annals of Combinatorics, 2002
For b 50
Estimating Reachability Algorithm
Tuples
Documents
t1
d1
t1
  • Pick some random tuples
  • Use tuples to query database
  • Extract tuples from matching documents to compute
    reachability graph edges
  • Estimate average outdegree
  • Estimate reachability using results of Chung and
    Lu, Annals of Combinatorics, 2002

d2
t2
t3
t3
d3
t4
d4
t2
t2
d 1.5
t4
51
Estimating Reachability of NYT
.46
Approximate reachability is estimated after 50
queries. Can be used to predict success (or
failure) of a Tuples querying strategy.
52
To Search or to Crawl? Towards a Query Optimizer
for Text-Centric Tasks, Ipeirotis, Agichtein,
Jain, Gavano, SIGMOD 2006
  • Information extraction applications extract
    structured relations from unstructured text

May 19 1995, Atlanta -- The Centers for Disease
Control and Prevention, which is in the front
line of the world's response to the deadly Ebola
epidemic in Zaire , is finding itself hard
pressed to cope with the crisis
Disease Outbreaks in The New York Times
Information Extraction System (e.g., NYUs
Proteus)
53
An Abstract View of Text-Centric Tasks
Ipeirotis, Agichtein, Jain, Gavano, SIGMOD 2006
Text Database
Extraction System
  • Retrieve documents from database
  • Process documents
  • Extract output tuples

54
Executing a Text-Centric Task
Text Database
Extraction System
  • Retrieve documents from database
  • Extract output tuples
  • Process documents
  • Two major execution paradigms
  • Scan-based Retrieve and process documents
    sequentially
  • Index-based Query database (e.g., case
    fatality rate), retrieve and process
    documents in results
  • Similar to relational world

?underlying data distribution dictates what is
best
  • Indexes are only approximate index is on
    keywords, not on tuples of interest
  • Choice of execution plan affects output
    completeness (not only speed)

Unlike the relational world
55
Execution Plan Characteristics
Question How do we choose the fastest execution
plan for reaching a target recall ?
Text Database
Extraction System
  • Retrieve documents from database
  • Process documents
  • Extract output tuples
  • Execution Plans have two main characteristics
  • Execution Time
  • Recall (fraction of tuples retrieved)

What is the fastest plan for discovering 10 of
the disease outbreaks mentioned in The New York
Times archive?
56
Outline
  • Description and analysis of crawl- and
    query-based plans
  • Scan
  • Filtered Scan
  • Iterative Set Expansion
  • Automatic Query Generation
  • Optimization strategy
  • Experimental results and conclusions

Crawl-based
Query-based
(Index-based)
57
Scan
Extraction System
Text Database
  • Extract output tuples
  • Process documents
  • Retrieve docs from database
  • Scan retrieves and processes documents
    sequentially (until reaching target recall)
  • Execution time Retrieved Docs (R P)

Question How many documents does Scan retrieve
to reach target recall?
Time for processing a document
Time for retrieving a document
Filtered Scan uses a classifier to identify and
process only promising documents (details in
paper)
58
Estimating Recall of Scan
  • Modeling Scan for tuple t
  • What is the probability of seeing t (with
    frequency g(t)) after retrieving S documents?
  • A sampling without replacement process
  • After retrieving S documents, frequency of tuple
    t follows hypergeometric distribution
  • Recall for tuple t is the probability that
    frequency of t in S docs 0
  • Probability of seeing tuple t after retrieving S
    documents
  • g(t) frequency of tuple t

59
Estimating Recall of Scan

  • Modeling Scan
  • Multiple sampling without replacement
    processes, one for each tuple
  • Overall recall is average recall across tuples
  • ? We can compute number of documents required to
    reach target recall

Execution time Retrieved Docs (R P)
60
Iterative Set Expansion
Text Database
Extraction System
Query Generation
  • Extract tuplesfrom docs
  • Process retrieved documents
  • Augment seed tuples with new tuples
  • Query database with seed tuples

(e.g., )
(e.g., Ebola AND Zaire)
  • Execution time Retrieved Docs (R P)
    Queries Q

Question How many queries and how many documents
does Iterative Set Expansion need to reach target
recall?
Question How many queries and how many documents
does Iterative Set Expansion need to reach target
recall?
Time for answering a query
Time for retrieving a document
Time for processing a document
61
Using Querying Graph for Analysis
tuples
Documents
  • We need to compute the
  • Number of documents retrieved after sending Q
    tuples as queries (estimates time)
  • Number of tuples that appear in the retrieved
    documents (estimates recall)
  • To estimate these we need to compute the
  • Degree distribution of the tuples discovered by
    retrieving documents
  • Degree distribution of the documents retrieved by
    the tuples
  • (Not the same as the degree distribution of a
    randomly chosen tuple or document it is easier
    to discover documents and tuples with high
    degrees)

t1
d1

d2
t2

t3
d3

t4
d4

t5
d5

62
Summary of Cost Analysis
  • Our analysis so far
  • Takes as input a target recall
  • Gives as output the time for each plan to reach
    target recall(time infinity, if plan cannot
    reach target recall)
  • Time and recall depend on task-specific
    properties of database
  • tuple degree distribution
  • Document degree distribution
  • Next, we show how to estimate degree
    distributions on-the-fly

63
Estimating Cost Model Parameters
  • tuple and document degree distributions belong to
    known distribution families

Can characterize distributions with only a few
parameters!
64
Parameter Estimation
  • Naïve solution for parameter estimation
  • Start with separate, parameter-estimation phase
  • Perform random sampling on database
  • Stop when cross-validation indicates high
    confidence
  • We can do better than this!
  • No need for separate sampling phase
  • Sampling is equivalent to executing the task
  • ?Piggyback parameter estimation into execution

65
On-the-fly Parameter Estimation
Correct (but unknown) distribution
  • Pick most promising execution plan for target
    recall assuming default parameter values
  • Start executing task
  • Update parameter estimates during execution
  • Switch plan if updated statistics indicate so
  • Important
  • Only Scan acts as random sampling
  • All other execution plan need parameter
    adjustment (see paper)

66
Outline
  • Description and analysis of crawl- and
    query-based plans
  • Optimization strategy
  • Experimental results and conclusions

67
Correctness of Theoretical Analysis
Task Disease Outbreaks Snowball IE
system 182,531 documents from NYT 16,921 tuples
  • Solid lines Actual time
  • Dotted lines Predicted time with correct
    parameters

68
Experimental Results (Information Extraction)
  • Solid lines Actual time
  • Green line Time with optimizer
  • (results similar in other experiments see
    paper)

69
Conclusions
  • Common execution plans for multiple text-centric
    tasks
  • Analytic models for predicting execution time and
    recall of various crawl- and query-based plans
  • Techniques for on-the-fly parameter estimation
  • Optimization framework picks on-the-fly the
    fastest plan for target recall

70
Can we do better?
  • Yes. For some information extraction systems

71
Bindings Engine (BE) Slides Cafarella 2005
  • Bindings Engine (BE) is search engine where
  • No downloads during query processing
  • Disk seeks constant in corpus size
  • queries phrases
  • BEs approach
  • Variabilized search query language
  • Pre-processes all documents before query-time
  • Integrates variable/type data with inverted
    index, minimizing query seeks

72
BE Query Support
  • cities such as
  • President Bush
  • is the capital of
  • reach me at
  • Any sequence of concrete terms and typed
    variables
  • NEAR is insufficient
  • Functions (e.g., head())

73
BE Operation
  • Like a generic search engine, BE
  • Downloads a corpus of pages
  • Creates an index
  • Uses index to process queries efficiently
  • BE further requires
  • Set of indexed types (e.g., NounPhrase), with a
    recognizer for each
  • String processing functions (e.g., head())
  • A BE system can only process types and functions
    that its index supports

74
(No Transcript)
75
Query such as
docs
docid0
docid1
docid2
dociddocs-1
  • Test for equality
  • Advance smaller pointer
  • Abort when a list is exhausted

docs
docid0
docid1
docid2
dociddocs-1

322
Returned docs
76
such as
In phrase queries, match positions as well
77
Neighbor Index
  • At each position in the index, store neighbor
    text that might be useful
  • Lets index and

I love cities such as Atlanta.
AdjT love
78
Neighbor Index
  • At each position in the index, store neighbor
    text that might be useful
  • Lets index and

I love cities such as Atlanta.
AdjT cities NP cities
AdjT I NP I
79
Neighbor Index
Query cities such as
I love cities such as Atlanta.
AdjT Atlanta NP Atlanta
AdjT such
80
cities such as
docs
pos0
pos1
dociddocs-1
posdocs-1
docid0
docid1

19



posns
pos0
pos1
pospos-1
posns
pos0
neighbor0
pos1
neighbor1
pospos-1
12
neighbor1
str1
neighbors
blk_offset
neighbor0
str0
NPright
Atlanta
3

AdjTleft
such
In doc 19, starting at posn 8
I love cities such as Atlanta.
  • Find phrase query positions, as with phrase
    queries
  • If term is adjacent to variable, extract typed
    value

81
Current Research Directions
  • Modeling explicit and Implicit network structures
  • Modeling evolution of explicit structure on web,
    blogspace, wikipedia
  • Modeling implicit link structures in text,
    collections, web
  • Exploiting implicit explicit social networks
    (e.g., for epidemiology)
  • Knowledge Discovery from Biological and Medical
    Data
  • Automatic sequence annotation ? bioinformatics,
    genetics
  • Actionable knowledge extraction from medical
    articles
  • Robust information extraction, retrieval, and
    query processing
  • Integrating information in structured and
    unstructured sources
  • Robust search/question answering for medical
    applications
  • Confidence estimation for extraction from text
    and other sources
  • Detecting reliable signals from (noisy) text data
    (e.g., medical surveillance)
  • Accuracy (!authority) of online sources
  • Information diffusion/propagation in online
    sources
  • Information propagation on the web
  • In collaborative sources (wikipedia, MedLine)

82
Page Quality In Search of an Unbiased Web
RankingCho, Roy, Adams, SIGMOD 2005
  • popular pages tend to get even more popular,
    while unpopular pages get ignored by an average
    user

83
Sic Transit Gloria Telae Towards an
Understanding of theWebs Decay Bar-Yossef,
Broder, Kumar, Tomkins, WWW 2004
84
Modeling Social Networks for
  • Epidemiology, security,

Email exchange mapped onto cubicle locations.
85
Some Research Directions
  • Modeling explicit and Implicit network structures
  • Modeling evolution of explicit structure on web,
    blogspace, wikipedia
  • Modeling implicit link structures in text,
    collections, web
  • Exploiting implicit explicit social networks
    (e.g., for epidemiology)
  • Knowledge Discovery from Biological and Medical
    Data
  • Automatic sequence annotation ? bioinformatics,
    genetics
  • Actionable knowledge extraction from medical
    articles
  • Robust information extraction, retrieval, and
    query processing
  • Integrating information in structured and
    unstructured sources
  • Query processing over unstructured text
  • Robust search/question answering for medical
    applications
  • Confidence estimation for extraction from text
    and other sources
  • Detecting reliable signals from (noisy) text data
    (e.g., medical surveillance)
  • Information diffusion/propagation in online
    sources
  • Information propagation on the web
  • In collaborative sources (wikipedia, MedLine)

86
Mining Text and Sequence Data
Agichtein Eskin, PSB 2004
ROC50 scores for each class and method
87
Some Research Directions
  • Modeling explicit and Implicit network structures
  • Modeling evolution of explicit structure on web,
    blogspace, wikipedia
  • Modeling implicit link structures in text,
    collections, web
  • Exploiting implicit explicit social networks
    (e.g., for epidemiology)
  • Knowledge Discovery from Biological and Medical
    Data
  • Automatic sequence annotation ? bioinformatics,
    genetics
  • Actionable knowledge extraction from medical
    articles
  • Robust information extraction, retrieval, and
    query processing
  • Integrating information in structured and
    unstructured sources
  • Robust search/question answering for medical
    applications
  • Confidence estimation for extraction from text
    and other sources
  • Detecting reliable signals from (noisy) text data
    (e.g., medical surveillance)
  • Accuracy (!authority) of online sources
  • Information diffusion/propagation in online
    sources
  • Information propagation on the web
  • In collaborative sources (wikipedia, MedLine)

88
Structure and evolution of blogspace Kumar,
Novak, Raghavan, Tomkins, CACM 2004, KDD 2006
Fraction of nodes in components of various sizes
within Flickr and Yahoo! 360 timegraph, by week.
89
Current Research Directions
  • Modeling explicit and Implicit network structures
  • Modeling evolution of explicit structure on web,
    blogspace, wikipedia
  • Modeling implicit link structures in text,
    collections, web
  • Exploiting implicit explicit social networks
    (e.g., for epidemiology)
  • Knowledge Discovery from Biological and Medical
    Data
  • Automatic sequence annotation ? bioinformatics,
    genetics
  • Actionable knowledge extraction from medical
    articles
  • Robust information extraction, retrieval, and
    query processing
  • Integrating information in structured and
    unstructured sources
  • Robust search/question answering for medical
    applications
  • Confidence estimation for extraction from text
    and other sources
  • Detecting reliable signals from (noisy) text data
    (e.g., medical surveillance)
  • Accuracy (!authority) of online sources
  • Information diffusion/propagation in online
    sources
  • Information propagation on the web, news
  • In collaborative sources (wikipedia, MedLine)

90
Thank You
  • Details
  • http//www.mathcs.emory.edu/eugene/
Write a Comment
User Comments (0)
About PowerShow.com