EntityBased Data Mining from SpatioTemporal Events and Text Sources Presentation at KDD Program Revi - PowerPoint PPT Presentation

About This Presentation
Title:

EntityBased Data Mining from SpatioTemporal Events and Text Sources Presentation at KDD Program Revi

Description:

Department of Computer Science. University of California, Irvine. KDD Program Review ... Events = Contacts, collaborations, meetings, products, etc. Working hypothesis ... – PowerPoint PPT presentation

Number of Views:69
Avg rating:3.0/5.0
Slides: 95
Provided by: Informatio367
Learn more at: https://ics.uci.edu
Category:

less

Transcript and Presenter's Notes

Title: EntityBased Data Mining from SpatioTemporal Events and Text Sources Presentation at KDD Program Revi


1
Entity-Based Data Mining fromSpatio-Temporal
Events and Text Sources Presentation at KD-D
Program Review, Nov 18-19 2003
  • Padhraic Smyth, Sharad Mehrotra
  • Information and Computer Science
  • University of California, Irvine
  • smyth, sharad_at_ics.uci.edu
  • www.datalab.uci.edu

2
Project Participants
  • Principal Investigators
  • Padhraic Smyth Data mining
  • Sharad Mehrotra Databases
  • Collaborators
  • Mark Steyvers Text and Author Modeling
  • Postdoctoral Researchers
  • Michal Rosen-Zvi, Dmitri Kalashnikov
  • Staff Programmer
  • Amnon Meyers Information Extraction
  • Students
  • Phd Joshua O Madadhain, Scott White, Yiming Ma,
    Dawit Seid
  • Undergraduates Yan-Biao Boey, Momo Alhazzazi
  • Acknowledgements
  • Steve Lawrence for CiteSeer data

3
Problem of Interest
  • Intelligence Analysis today
  • Massive volumes/streams of data
  • Text (newswire, reports, etc)
  • Web data
  • Transactions/events
  • Central problems
  • Need flexible tools to support an analysts
    exploration of the data
  • Automatically focus an analysts attention on
    interesting parts of the data space
  • Need new theories/methods/tools.

4
Entities and Events
  • Entities Individuals, groups, communities,
    organizations, etc
  • Events Contacts, collaborations, meetings,
    products, etc
  • Working hypothesis
  • A large component of intelligence work is
    centered on entities and events
  • Extracting entity-information from text streams
    and transaction data
  • Predicting entity behavior
  • Detecting groups of related entities
  • Our broad goal
  • Develop next-generation data management,
    exploration, and analysis tools for entity-event
    data

5
Nodes Entities Biotech-Related
Organizations Edges Events Collaborations
6
Red indicates nodes selected by the data analyst
as important
7
Algorithm determines blue nodes are important
relative to red nodes (Oxford and Cambridge)
8
  • Research Issues
  • Information extraction
  • Data management tools
  • Visualization techniques
  • Interactive ad hoc querying and mining
  • Statistical modeling of graph data
  • Query languages for graphs
  • Scalability to large graphs

9
Focus of Our Research
Text Sources
Information Extraction
Entity-Event Databases
Statistical Modeling and Data Mining
Visualization
Query Languages
User Modeling
10
Major Themes in Our Work
  • Focus on data in the form of graphs
  • Nodes entities, edges events
  • Nodes and edges have attributes (e.g., temporal)
  • Year 1 entities computer science researchers
  • Year 1 limited spatio-temporal aspects
  • Integration and coupling of
  • Statistical modeling and data mining
  • Visualization
  • Query languages and data management
  • Scalability
  • Methods should scale to millions of nodes and
    edges
  • User Interaction
  • Conditional query-driven analysis and mining
  • Contrast with offline global modeling

11
Accomplishments
  • Infrastructure and Data Sets
  • Created testbed data sets, e.g., 100k entities,
    400k events
  • Developed suite of text information extraction
    tools Developed and released a general
    public-domain JAVA API for graph data analysis
    and visualization
  • Statistical Modeling and Data Mining
  • Developed new statistical technique for modeling
    entities based on authored text
  • Developed new class of scalable algorithms for
    interactive graph-based data mining

12
Accomplishments
  • Graph-based Querying
  • Developed framework for general graph-based query
    language
  • New accurate and efficient algorithms for
    interactive similarity queries and query
    refinement on graphs
  • Software Tools
  • Netsight JAVA-based graph visualization and
    analysis tool
  • Browser tool for exploring author-topic models
  • Interactive query refinement system
  • Prototype system for graph-based query language
    for interacting with heterogenous graph data

13
Publications in Year 1
  • Data Mining on Graphs
  • S. White and P. Smyth, Algorithms for Discovering
    Relative Importance In Graphs, Proceedings of the
    Ninth International ACM SIGKDD Conference, August
    2003. Extended version submitted to JICRD, June
    2003.
  • J. O'Madadhain, D. Fisher, S. White, and Y. Boey,
    The JUNG (Java Universal Network/Graph)
    Framework, UCI-ICS Tech Report 03-17, October
    2003 invited presentation, Stanford Workshop on
    Statistical Inference, Computing and
    Visualization for Graphs, August 2003.
  • Modeling the Internet and the Web Probabilistic
    Methods and Algorithms, P. Baldi, P. Frasconi,
    and P. Smyth, Wiley, June 2003.
  • Statistical Author-Topic Models
  • T. Griffiths and M. Steyvers (in press). Finding
    Scientific Topics. Proceedings of the National
    Academy of Sciences
  • M. Steyvers, M. Rosen-Zvi, T. Griffiths, P.
    Smyth, Author Attribution with LDA, NIPS workshop
    on Syntax, Semantics, and Statistics, December
    2003
  • Data Management and Graph Querying
  • Y. Ma, S. Mehrotra, D. Seid, A Framework for
    Refining Similarity Queries Using Learning
    Techniques, UCI-ICS Tech Report 03-19, Nov. 2003.
    Extended version submitted to EDBT 2004.
  • Y. Ma, D. Seid, S. Mehrotra, Interactive
    Filtering of Data Streams by Refining Similarity
    Queries, UCI-ICS Tech Report 03-07, June. 2003.

14
Data Sets
15
Information Extraction
16
Author Database Schema
Note individual-centric not
document-centric
17
Focus of Our Research
Text Sources
Information Extraction
Entity-Event Databases
Statistical Modeling and Data Mining
Visualization
Query Languages
User Modeling
18
9/11 Network
19
From graphs to Markov chains
C
3
4
B
A
2
D
2
  • Importance recursive function of nodes pointing
    at you

20
From graphs to Markov chains
C
3
C
0.6
1.0
0.33
4
B
A
2
B
A
0.5
0.4
0.77
0.33
D
2
0.5
D
  • Importance recursive function of nodes pointing
    at you

21
From graphs to Markov chains
C
3
C
0.6
1.0
0.33
4
B
A
2
B
A
0.5
0.4
0.77
0.33
D
2
0.5
D
  • Importance recursive function of nodes pointing
    at you
  • Markov approach
  • Notion of a token circulating around in Markov
    fashion
  • Important actors see the token more often
  • Importance stationary probability of each node
  • PageRank surfer randomly following links on the
    Web

22
(No Transcript)
23
(No Transcript)
24
Relative importance of node V to A Trade off
distance from A, structural importance of V
25
Add backlinks to A with probability b (e.g., 0.3)
26
Algorithms for Relative Importance(S. White and
P. Smyth, ACM KDD 2003 also JICRD, submitted)
  • PageRank with Priors (PRankP)
  • Random walks that start from A and return to A
    periodically
  • Relative importance stationary probability
  • Iterative algorithm (e.g., Haveliwala, 2002)
  • HITS with priors
  • Formulate HITS as Markov chain, same idea.
  • K-Step Markov
  • Use the transient probability distribution
    starting from A
  • Faster than stationary probability methods
  • Weighted Paths
  • Heuristic approximation to K-step Markov even
    faster
  • All algorithms scale linearly in number of edges
  • Different constant factors

27
Experiments on Real-World Data
  • Terrorist Network
  • 63 nodes (terrorists)
  • 308 edges (known interactions)
  • Biotech Collaboration Network
  • 2700 nodes (biotech companies collaborators)
  • 8690 edges (known collaborations)
  • CiteSeer Co-authorship Network
  • 85k nodes (authors)
  • 168k edges (collaborations)

28
Computation Times for Ranking Algorithms (in
seconds)
PRankP and HITS converged in 20-30 iterations
29
Computation Times for Ranking Algorithms (in
seconds)
PRankP and HITS converged in 20-30 iterations
30
(No Transcript)
31
(No Transcript)
32
Weighted versus Unweighted Graphs
33
Visualization and Analysis Software
34
JUNG Java Universal Network/Graph API
  • API for modeling, analyzing, and visualizing
    graphs
  • extendible object-oriented framework
  • makes use of existing Java APIs
  • provides a common language for handling graphs
  • open-source (encourages collaboration, reduces
    duplicated effort)
  • well-suited for building network data mining
    tools/applications
  • Features and contributions
  • Annotation of nodes and edges, filtering of
    graphs
  • support for multiple network types (directed,
    bipartite, affiliation)
  • visualization API for creating custom layouts and
    renderers
  • Multiple algorithms for clustering, connectivity,
    distances, flows, and importance ranking
  • Netsight graph analysis and visualization tool
  • Developed using the JUNG framework

35
JUNG Java Universal Network/Graph Framework
  • http//jung.sourceforge.net

16,000 page visits 800 downloads since August
36
Demo of Netsight software
37
Entity Models from Text Data
38
Authors
Words
Can we model authors, given documents? (more
generally, build statistical profiles of
entities given sparse observed data)
39
Authors
Hidden Topics
Words
Model Author-Topic distributions Topic-Word
distributions Parameters learned via Bayesian
learning
40
Authors
Hidden Topics
Words
41
Authors
Hidden Topics
Words
42
Authors
Hidden Topics
Words
43
Authors
Hidden Topics
Words
44
Authors
Hidden Topics
Words
45
Authors
Hidden Topics
Words
46
Hidden Topics
Words
Topic Model - document can be generated from
multiple topics - Hofmann (SIGIR 99), Blei,
Jordan, Ng (JMLR, 2003)
47
Authors
Hidden Topics
Words
Model Author-Topic distributions Topic-Word
distributions NOTE documents can be composed of
multiple topics
48
Author Modeling Data Sets
49
Topic Models from CiteSeer
  • WORDS probabilistic, Bayesian, carlo, monte,
    distribution, inference, conditional, prior,
    mixture, Markov, posterior, belief
  • AUTHORS N_Friedman, D_Heckerman, Z_Ghahramani,
    D_Koller, M_Jordan, R_Neal, A_Raftery,
    T_Lukasiewicz, J_Halpern.
  • WORDS retrieval, text, document, information,
    content, indexing, relevance, collection, query,
    IR, feedback.
  • AUTHORS D. Oard, W_Croft, K_Jones, P_Schauble,
    E_Voorhees, A_Singhal, D_Hawking, J_Allan,
    A_Smeaton, M_Hearst,.

50
Topic Models from CiteSeer
  • WORDS Web, user, world, wide, pages, www, site,
    internet, hypertext, hypermedia, content, links,
    page, navigation..
  • AUTHORS S. Lawrence, B. Mobasher, M. Levene, D.
    Florescu, O. Etzioni, R_Studer, W. Hall, R.
    Fielding, J. Pitkow, M. Crovella,.
  • WORDS data, mining, attributes, discovery,
    association, large, knowledge, databases,
    dataset, interesting, frequent, discover, sets.
  • AUTHORS J. Han, R. Rastogi, M. Zaki, R. Ng, B.
    Liu, H. Mannila, S. Brin, H Liu, L. Holder, H.
    Toivonen

51
Author-Topic Models from CiteSeer
  • Author A McCallum
  • Topic 1 classification, training,
    generalization, decision, data,
  • Topic 2 learning, machine, examples,
    reinforcement, inductive,..
  • Topic 3 retrieval, text, document, information,
    content,
  • Author H Garcia-Molina
  • - Topic 1 query, index, data, join, processing,
    aggregate.
  • - Topic 2 transaction, concurrency, copy,
    permission,distributed.
  • - Topic 3 source, separation, paper,
    heterogeneous, merging..
  • Author P Cohen
  • - Topic 1 agent, multi, coordination,
    autonomous, intelligent.
  • - Topic 2 planning, action, goal, world,
    execution, situation
  • - Topic 3 human, interaction, people,
    cognitive, social, natural.

52
Author-Topic Browser
  • Interesting scalability issues
  • CiteSeer model exceeds 1 Gbyte
  • Real-time query answering demands Gibbs sampling
    (not well suited to SQL!)
  • Solution
  • Coupling of Gibbs sampling and relational DB (it
    works!)

JAVA Query GUI

SQL Interface
Bayesian Sampling
MySQL DB
Original Text Statistical Model
53
Demo of Author-Topic Browser
  • Note
  • Real-time querying on CiteSeer authors/documents
  • 85,000 authors
  • 163,000 documents
  • 30,000 unique words
  • 300 topics
  • Can query on
  • Authors, topics, words, documents
  • Topic distribution given documents/words requires
    sampling to estimate
  • Gibbs sampling is fast enough to answer queries
    in real-time

54
Applications of Author-Topic Models
  • Expert Finder
  • Find researchers who are knowledgeable in
    cryptography and machine learning within 100
    miles of Washington DC
  • Find reviewers for this set of NSF proposals who
    are active in relevant topics and have no
    conflicts of interest
  • Prediction
  • Given a document and some subset of known authors
    for the paper (k0,1,2), predict the other
    authors
  • Predict how many papers in different topics will
    appear next year
  • Change Detection/Monitoring
  • Which authors are on the leading edge of new
    topics?
  • Characterize the topic trajectory of this
    author over time

55
Data and Topic Models
  • Topic-author with 300 topics model built from
    162,489 CiteSeer abstracts
  • Each word in each document assigned to a topic
  • For the subset of 131,602 documents that we know
    the year
  • Group documents by year
  • Calculate the fraction of words each year
    assigned to a topic
  • Plot the resulting time-series, 1990 to 2002
  • Caveats
  • Data set is incomplete (see next slide)
  • Relatively few documents from 2001 and 2002

56
(No Transcript)
57
Rise in Web, Mobile, JAVA
Web
58
Rise of Machine Learning
59
Bayes lives on.
60
Decline in Languages, OS,
61
Decline in CS Theory,
62
Trends in Database Research
63
Trends in NLP and IR
NLP
IR
64
Security Research Reborn
65
(Not so) Hot Topics
Neural Networks
GAs
Wavelets
66
Decline in use of Greek Letters ?
67
Graph-based Query Refinement and Query Languages
68
Heterogeneous Event-Entity Querying
  • Problem
  • Most existing graph/link mining approaches assume
    single node types (e.g. people, documents, etc.)
    and restricted link types (e.g. collaboration,
    html links, etc.)
  • Solution
  • Single framework that enables analysts to mine
    heterogeneous event-entity data

69
Supporting Exploratory Event-Entity Graph Analysis
Example tasks
Our Approach
  • Influence/dependence analysis
  • Prediction of links between entity type 1 and
    entity type 2, given their relation to entity 3.
  • Compute strength of relationship between a given
    pair of individuals or groups with varying edge
    and node types.
  • Given the overall schema and graph data
  • Subschema selection
  • Subgraph selection (data filtering)
  • Decoration of Data Graph Nodes and Edges
  • Structural Grouping and Aggregation
  • May also involve aggregation of decoration
    values.
  • Progressive/Interactive Refinement

70
The GrAQ System(built using JUNG library)
71
Status of Work
  • Achievements
  • query language for interactive graph analysis
  • Aggregation operators for graph data analysis.
  • Similarity predicates and ranking for analysis
    involving imprecise matching
  • Integration of concept hierarchies in graph data
    analysis
  • System development over a commercial ORDBMS
  • Future Work
  • Model and language extensions to support
    spatio-temporal graph analysis
  • Efficient support for graph analysis queries
  • Graph indexing strategies
  • Query processing and optimization
  • Integration of feedback based query refinement in
    graph analysis queries

72
Interactive Querying and Refinement
  • Relevance-based retrieval
  • Queries approximately capture users information
    need
  • Ranked retrieval based on relevance of object to
    query
  • Query Refinement
  • Customization based on users subjectivity,
    information need, and preferences
  • Existing Search Technologies
  • Database Systems do not support relevance based
    retrieval (only exact search)
  • IR systems support (limited) aspect of
    similarity retrieval but are limited to textual
    data.

73
(No Transcript)
74
Similarity Queries in SQL are Complex!
75
Evaluation of Query Refinement
  • Tested on multiple real data sets
  • Average precision on 400 queries over 4
    refinements
  • The new methods outperform existing methods
  • substantially fewer iterations required

76
Other work in progress..
  • Edge prediction in graphs
  • Given a graph with attributes on nodes and edges
  • Assume some edges are missing (or remove them)
  • Predict the probability of edge(i,j)
  • E.g., what is likelihood that A and B have
    interacted given everything else we know, or that
    they will interact within the next 6 months
  • Note runtime querying, avoid O(N2) complexity
  • Data cleaning
  • multiple names for a single entity
  • multiple entities mapped to the same name, e.g.,
    J_Wang
  • How many unique P_Smyths are there?
  • Use heterogenous data sources and probabilistic
    models to iteratively produce consistent data
  • E.g., combine CiteSeer, Web information, topic
    models, institution, etc

77
Conclusions
78
Summary of Accomplishments
  • Infrastructure
  • Developed entity-event testbed data sets and IE
    tools
  • Released JUNG API for graph data analysis and
    visualization
  • Graph Data Analysis/Querying Research
  • Novel author-topic models
  • New class of relative importance algorithms
  • Efficient similarity query refinement system
  • New general framework for graph schemas
  • Software
  • Netsight
  • Topic-Author Browser
  • Interactive query refinement system
  • Prototype graph-based DB language system

79
Whats ready for the KD-D TestBed?
  • Netsight
  • Built on JUNG API
  • Can handle any standard network data set
  • Supports both visualization and analysis
  • Relative importance algorithms
  • Relative betweenness algorithms
  • Graph layout and browsing
  • Graph filtering
  • Easily extendible
  • Integrated database support is planned in Year 2
  • Other software is also in principle available
  • Author-topic applications
  • e.g., find experts in South Florida in virus
    research
  • - GraQ tool for graph DB interface

80
Proposed Year 2 Work
  • Basic research extend theory and algorithms to
  • Extend to temporal and spatial semantics
  • Handle missing/noisy network data
  • Multi-edge types (multiple edges on same
    entities)
  • Scalability graphs with millions of edges
  • Interaction tools that support exploration and
    querying
  • Integration and Coupling of
  • Statistical topic models, querying, graph
    visualization, and databases
  • Software Tools and Applications for the
    KDD-testbed
  • Netsight as an analysis tool
  • Application of Author-topic type model (e.g.,
    expert finder)
  • Entity Monitoring application (monitor data
    sources over time with focused Web crawling)
  • Data Sets/Types (TBD)
  • KDD-provided testbed data sets
  • Digital libraries more CiteSeer, possibly Patent
    DB, MEDLINE
  • Less structured text sources such as email
    streams

81
BACKUP SLIDES
82
(No Transcript)
83
(No Transcript)
84
(No Transcript)
85
(No Transcript)
86
(No Transcript)
87
(No Transcript)
88
(No Transcript)
89
(No Transcript)
90
(No Transcript)
91
Perplexities for true author and any random
author
A true author
A any author
Percentiles In distribution
92
  • Accuracy of author prediction as a function of
    topics

of documents for which correct author was picked
93
Heterogeneous Event-Entity Graph Analysis and
Query Language
  • Analysis of link/graph data involves
  • Subschema selection
  • Selecting node and edge types of interest from
    the graph schema
  • Subgraph selection
  • Identifying relevant members of a group based on
    (possibly imprecise) matching of edge/node
    attributes or involvement in a given pattern of
    relationship.
  • Decoration
  • E.g. computation of pair-wise association
    measures between individual entities (conditioned
    on a context or third entity type)
  • Structural Grouping and Aggregation
  • Node/edge grouping
  • combination of decorations (or other attribute
    values) for groups of entities at various levels.
  • Progressive Refinement
  • carrying out the above operations in a
    progressive and interactive manner. In
    particular, user should be able to ask queries
    based on results of previous queries.

94
P(author and topic given a word)
  • P(Ai,ZiW,Z\Zi,A\Ai) ? (CWZ
    ?)(CAZ?)/(?WCWZV?)

CWZ counts the number of times the same word, W,
(in the same or other documents) is assigned to
topic Z
CAZ counts the number of times the same author,
A, (in the same or other documents) is assigned
to topic Z
Keeping these counts speeds up the algorithm!
95
Sampling over a query document
Preprocessing Assign to each word in the query
document an Author and a Topic
K Iterations (typically K10)
  • For each word out of the N query words
  • Derive the probability P(A, Z) conditioned on
    the current assignments of query words and the
    database words
  • Assign a new author, A, and topic, Z, according
    to P(A,Z)

The probability for a topic is the averaged ratio
of words assigned to the topic per total words
P(Z)?Kt1CtZ/(KN) CtZ is the number of words
assigned in the t iteration to the z topic
96
(No Transcript)
97
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com