Title: EntityBased Data Mining from SpatioTemporal Events and Text Sources Presentation at KDD Program Revi
1 Entity-Based Data Mining fromSpatio-Temporal
Events and Text Sources Presentation at KD-D
Program Review, Nov 18-19 2003
- Padhraic Smyth, Sharad Mehrotra
- Information and Computer Science
- University of California, Irvine
- smyth, sharad_at_ics.uci.edu
- www.datalab.uci.edu
2Project Participants
- Principal Investigators
- Padhraic Smyth Data mining
- Sharad Mehrotra Databases
- Collaborators
- Mark Steyvers Text and Author Modeling
- Postdoctoral Researchers
- Michal Rosen-Zvi, Dmitri Kalashnikov
- Staff Programmer
- Amnon Meyers Information Extraction
- Students
- Phd Joshua O Madadhain, Scott White, Yiming Ma,
Dawit Seid - Undergraduates Yan-Biao Boey, Momo Alhazzazi
- Acknowledgements
- Steve Lawrence for CiteSeer data
3Problem of Interest
- Intelligence Analysis today
- Massive volumes/streams of data
- Text (newswire, reports, etc)
- Web data
- Transactions/events
- Central problems
- Need flexible tools to support an analysts
exploration of the data - Automatically focus an analysts attention on
interesting parts of the data space - Need new theories/methods/tools.
4Entities and Events
- Entities Individuals, groups, communities,
organizations, etc - Events Contacts, collaborations, meetings,
products, etc - Working hypothesis
- A large component of intelligence work is
centered on entities and events - Extracting entity-information from text streams
and transaction data - Predicting entity behavior
- Detecting groups of related entities
- Our broad goal
- Develop next-generation data management,
exploration, and analysis tools for entity-event
data
5Nodes Entities Biotech-Related
Organizations Edges Events Collaborations
6Red indicates nodes selected by the data analyst
as important
7Algorithm determines blue nodes are important
relative to red nodes (Oxford and Cambridge)
8- Research Issues
- Information extraction
- Data management tools
- Visualization techniques
- Interactive ad hoc querying and mining
- Statistical modeling of graph data
- Query languages for graphs
- Scalability to large graphs
-
9Focus of Our Research
Text Sources
Information Extraction
Entity-Event Databases
Statistical Modeling and Data Mining
Visualization
Query Languages
User Modeling
10Major Themes in Our Work
- Focus on data in the form of graphs
- Nodes entities, edges events
- Nodes and edges have attributes (e.g., temporal)
- Year 1 entities computer science researchers
- Year 1 limited spatio-temporal aspects
- Integration and coupling of
- Statistical modeling and data mining
- Visualization
- Query languages and data management
- Scalability
- Methods should scale to millions of nodes and
edges - User Interaction
- Conditional query-driven analysis and mining
- Contrast with offline global modeling
11 Accomplishments
- Infrastructure and Data Sets
- Created testbed data sets, e.g., 100k entities,
400k events - Developed suite of text information extraction
tools Developed and released a general
public-domain JAVA API for graph data analysis
and visualization - Statistical Modeling and Data Mining
- Developed new statistical technique for modeling
entities based on authored text - Developed new class of scalable algorithms for
interactive graph-based data mining
12 Accomplishments
- Graph-based Querying
- Developed framework for general graph-based query
language - New accurate and efficient algorithms for
interactive similarity queries and query
refinement on graphs - Software Tools
- Netsight JAVA-based graph visualization and
analysis tool - Browser tool for exploring author-topic models
- Interactive query refinement system
- Prototype system for graph-based query language
for interacting with heterogenous graph data
13Publications in Year 1
- Data Mining on Graphs
- S. White and P. Smyth, Algorithms for Discovering
Relative Importance In Graphs, Proceedings of the
Ninth International ACM SIGKDD Conference, August
2003. Extended version submitted to JICRD, June
2003. - J. O'Madadhain, D. Fisher, S. White, and Y. Boey,
The JUNG (Java Universal Network/Graph)
Framework, UCI-ICS Tech Report 03-17, October
2003 invited presentation, Stanford Workshop on
Statistical Inference, Computing and
Visualization for Graphs, August 2003. - Modeling the Internet and the Web Probabilistic
Methods and Algorithms, P. Baldi, P. Frasconi,
and P. Smyth, Wiley, June 2003. - Statistical Author-Topic Models
- T. Griffiths and M. Steyvers (in press). Finding
Scientific Topics. Proceedings of the National
Academy of Sciences - M. Steyvers, M. Rosen-Zvi, T. Griffiths, P.
Smyth, Author Attribution with LDA, NIPS workshop
on Syntax, Semantics, and Statistics, December
2003 - Data Management and Graph Querying
- Y. Ma, S. Mehrotra, D. Seid, A Framework for
Refining Similarity Queries Using Learning
Techniques, UCI-ICS Tech Report 03-19, Nov. 2003.
Extended version submitted to EDBT 2004. - Y. Ma, D. Seid, S. Mehrotra, Interactive
Filtering of Data Streams by Refining Similarity
Queries, UCI-ICS Tech Report 03-07, June. 2003.
14Data Sets
15Information Extraction
16Author Database Schema
Note individual-centric not
document-centric
17Focus of Our Research
Text Sources
Information Extraction
Entity-Event Databases
Statistical Modeling and Data Mining
Visualization
Query Languages
User Modeling
189/11 Network
19From graphs to Markov chains
C
3
4
B
A
2
D
2
- Importance recursive function of nodes pointing
at you
20From graphs to Markov chains
C
3
C
0.6
1.0
0.33
4
B
A
2
B
A
0.5
0.4
0.77
0.33
D
2
0.5
D
- Importance recursive function of nodes pointing
at you
21From graphs to Markov chains
C
3
C
0.6
1.0
0.33
4
B
A
2
B
A
0.5
0.4
0.77
0.33
D
2
0.5
D
- Importance recursive function of nodes pointing
at you - Markov approach
- Notion of a token circulating around in Markov
fashion - Important actors see the token more often
- Importance stationary probability of each node
- PageRank surfer randomly following links on the
Web
22(No Transcript)
23(No Transcript)
24Relative importance of node V to A Trade off
distance from A, structural importance of V
25Add backlinks to A with probability b (e.g., 0.3)
26Algorithms for Relative Importance(S. White and
P. Smyth, ACM KDD 2003 also JICRD, submitted)
- PageRank with Priors (PRankP)
- Random walks that start from A and return to A
periodically - Relative importance stationary probability
- Iterative algorithm (e.g., Haveliwala, 2002)
- HITS with priors
- Formulate HITS as Markov chain, same idea.
- K-Step Markov
- Use the transient probability distribution
starting from A - Faster than stationary probability methods
- Weighted Paths
- Heuristic approximation to K-step Markov even
faster - All algorithms scale linearly in number of edges
- Different constant factors
27Experiments on Real-World Data
- Terrorist Network
- 63 nodes (terrorists)
- 308 edges (known interactions)
- Biotech Collaboration Network
- 2700 nodes (biotech companies collaborators)
- 8690 edges (known collaborations)
- CiteSeer Co-authorship Network
- 85k nodes (authors)
- 168k edges (collaborations)
28Computation Times for Ranking Algorithms (in
seconds)
PRankP and HITS converged in 20-30 iterations
29Computation Times for Ranking Algorithms (in
seconds)
PRankP and HITS converged in 20-30 iterations
30(No Transcript)
31(No Transcript)
32Weighted versus Unweighted Graphs
33Visualization and Analysis Software
34JUNG Java Universal Network/Graph API
- API for modeling, analyzing, and visualizing
graphs - extendible object-oriented framework
- makes use of existing Java APIs
- provides a common language for handling graphs
- open-source (encourages collaboration, reduces
duplicated effort) - well-suited for building network data mining
tools/applications - Features and contributions
- Annotation of nodes and edges, filtering of
graphs - support for multiple network types (directed,
bipartite, affiliation) - visualization API for creating custom layouts and
renderers - Multiple algorithms for clustering, connectivity,
distances, flows, and importance ranking - Netsight graph analysis and visualization tool
- Developed using the JUNG framework
35JUNG Java Universal Network/Graph Framework
- http//jung.sourceforge.net
16,000 page visits 800 downloads since August
36Demo of Netsight software
37Entity Models from Text Data
38Authors
Words
Can we model authors, given documents? (more
generally, build statistical profiles of
entities given sparse observed data)
39Authors
Hidden Topics
Words
Model Author-Topic distributions Topic-Word
distributions Parameters learned via Bayesian
learning
40Authors
Hidden Topics
Words
41Authors
Hidden Topics
Words
42Authors
Hidden Topics
Words
43Authors
Hidden Topics
Words
44Authors
Hidden Topics
Words
45Authors
Hidden Topics
Words
46Hidden Topics
Words
Topic Model - document can be generated from
multiple topics - Hofmann (SIGIR 99), Blei,
Jordan, Ng (JMLR, 2003)
47Authors
Hidden Topics
Words
Model Author-Topic distributions Topic-Word
distributions NOTE documents can be composed of
multiple topics
48Author Modeling Data Sets
49Topic Models from CiteSeer
- WORDS probabilistic, Bayesian, carlo, monte,
distribution, inference, conditional, prior,
mixture, Markov, posterior, belief - AUTHORS N_Friedman, D_Heckerman, Z_Ghahramani,
D_Koller, M_Jordan, R_Neal, A_Raftery,
T_Lukasiewicz, J_Halpern. - WORDS retrieval, text, document, information,
content, indexing, relevance, collection, query,
IR, feedback. - AUTHORS D. Oard, W_Croft, K_Jones, P_Schauble,
E_Voorhees, A_Singhal, D_Hawking, J_Allan,
A_Smeaton, M_Hearst,.
50Topic Models from CiteSeer
- WORDS Web, user, world, wide, pages, www, site,
internet, hypertext, hypermedia, content, links,
page, navigation.. - AUTHORS S. Lawrence, B. Mobasher, M. Levene, D.
Florescu, O. Etzioni, R_Studer, W. Hall, R.
Fielding, J. Pitkow, M. Crovella,. - WORDS data, mining, attributes, discovery,
association, large, knowledge, databases,
dataset, interesting, frequent, discover, sets. - AUTHORS J. Han, R. Rastogi, M. Zaki, R. Ng, B.
Liu, H. Mannila, S. Brin, H Liu, L. Holder, H.
Toivonen
51Author-Topic Models from CiteSeer
- Author A McCallum
- Topic 1 classification, training,
generalization, decision, data, - Topic 2 learning, machine, examples,
reinforcement, inductive,.. - Topic 3 retrieval, text, document, information,
content, - Author H Garcia-Molina
- - Topic 1 query, index, data, join, processing,
aggregate. - - Topic 2 transaction, concurrency, copy,
permission,distributed. - - Topic 3 source, separation, paper,
heterogeneous, merging.. - Author P Cohen
- - Topic 1 agent, multi, coordination,
autonomous, intelligent. - - Topic 2 planning, action, goal, world,
execution, situation - - Topic 3 human, interaction, people,
cognitive, social, natural.
52Author-Topic Browser
- Interesting scalability issues
- CiteSeer model exceeds 1 Gbyte
- Real-time query answering demands Gibbs sampling
(not well suited to SQL!) - Solution
- Coupling of Gibbs sampling and relational DB (it
works!)
JAVA Query GUI
SQL Interface
Bayesian Sampling
MySQL DB
Original Text Statistical Model
53Demo of Author-Topic Browser
- Note
- Real-time querying on CiteSeer authors/documents
- 85,000 authors
- 163,000 documents
- 30,000 unique words
- 300 topics
- Can query on
- Authors, topics, words, documents
- Topic distribution given documents/words requires
sampling to estimate - Gibbs sampling is fast enough to answer queries
in real-time
54Applications of Author-Topic Models
- Expert Finder
- Find researchers who are knowledgeable in
cryptography and machine learning within 100
miles of Washington DC - Find reviewers for this set of NSF proposals who
are active in relevant topics and have no
conflicts of interest - Prediction
- Given a document and some subset of known authors
for the paper (k0,1,2), predict the other
authors - Predict how many papers in different topics will
appear next year - Change Detection/Monitoring
- Which authors are on the leading edge of new
topics? - Characterize the topic trajectory of this
author over time
55Data and Topic Models
- Topic-author with 300 topics model built from
162,489 CiteSeer abstracts - Each word in each document assigned to a topic
- For the subset of 131,602 documents that we know
the year - Group documents by year
- Calculate the fraction of words each year
assigned to a topic - Plot the resulting time-series, 1990 to 2002
- Caveats
- Data set is incomplete (see next slide)
- Relatively few documents from 2001 and 2002
56(No Transcript)
57Rise in Web, Mobile, JAVA
Web
58Rise of Machine Learning
59Bayes lives on.
60Decline in Languages, OS,
61Decline in CS Theory,
62Trends in Database Research
63Trends in NLP and IR
NLP
IR
64Security Research Reborn
65(Not so) Hot Topics
Neural Networks
GAs
Wavelets
66Decline in use of Greek Letters ?
67Graph-based Query Refinement and Query Languages
68Heterogeneous Event-Entity Querying
- Problem
- Most existing graph/link mining approaches assume
single node types (e.g. people, documents, etc.)
and restricted link types (e.g. collaboration,
html links, etc.) - Solution
- Single framework that enables analysts to mine
heterogeneous event-entity data
69Supporting Exploratory Event-Entity Graph Analysis
Example tasks
Our Approach
- Influence/dependence analysis
- Prediction of links between entity type 1 and
entity type 2, given their relation to entity 3. - Compute strength of relationship between a given
pair of individuals or groups with varying edge
and node types.
- Given the overall schema and graph data
- Subschema selection
- Subgraph selection (data filtering)
- Decoration of Data Graph Nodes and Edges
- Structural Grouping and Aggregation
- May also involve aggregation of decoration
values. - Progressive/Interactive Refinement
70The GrAQ System(built using JUNG library)
71Status of Work
- Achievements
- query language for interactive graph analysis
- Aggregation operators for graph data analysis.
- Similarity predicates and ranking for analysis
involving imprecise matching - Integration of concept hierarchies in graph data
analysis - System development over a commercial ORDBMS
- Future Work
- Model and language extensions to support
spatio-temporal graph analysis - Efficient support for graph analysis queries
- Graph indexing strategies
- Query processing and optimization
- Integration of feedback based query refinement in
graph analysis queries
72Interactive Querying and Refinement
- Relevance-based retrieval
- Queries approximately capture users information
need - Ranked retrieval based on relevance of object to
query - Query Refinement
- Customization based on users subjectivity,
information need, and preferences - Existing Search Technologies
- Database Systems do not support relevance based
retrieval (only exact search) - IR systems support (limited) aspect of
similarity retrieval but are limited to textual
data.
73(No Transcript)
74Similarity Queries in SQL are Complex!
75Evaluation of Query Refinement
- Tested on multiple real data sets
- Average precision on 400 queries over 4
refinements - The new methods outperform existing methods
- substantially fewer iterations required
76Other work in progress..
- Edge prediction in graphs
- Given a graph with attributes on nodes and edges
- Assume some edges are missing (or remove them)
- Predict the probability of edge(i,j)
- E.g., what is likelihood that A and B have
interacted given everything else we know, or that
they will interact within the next 6 months - Note runtime querying, avoid O(N2) complexity
- Data cleaning
- multiple names for a single entity
- multiple entities mapped to the same name, e.g.,
J_Wang - How many unique P_Smyths are there?
- Use heterogenous data sources and probabilistic
models to iteratively produce consistent data - E.g., combine CiteSeer, Web information, topic
models, institution, etc
77Conclusions
78Summary of Accomplishments
- Infrastructure
- Developed entity-event testbed data sets and IE
tools - Released JUNG API for graph data analysis and
visualization - Graph Data Analysis/Querying Research
- Novel author-topic models
- New class of relative importance algorithms
- Efficient similarity query refinement system
- New general framework for graph schemas
- Software
- Netsight
- Topic-Author Browser
- Interactive query refinement system
- Prototype graph-based DB language system
79Whats ready for the KD-D TestBed?
- Netsight
- Built on JUNG API
- Can handle any standard network data set
- Supports both visualization and analysis
- Relative importance algorithms
- Relative betweenness algorithms
- Graph layout and browsing
- Graph filtering
- Easily extendible
- Integrated database support is planned in Year 2
- Other software is also in principle available
- Author-topic applications
- e.g., find experts in South Florida in virus
research - - GraQ tool for graph DB interface
80Proposed Year 2 Work
- Basic research extend theory and algorithms to
- Extend to temporal and spatial semantics
- Handle missing/noisy network data
- Multi-edge types (multiple edges on same
entities) - Scalability graphs with millions of edges
- Interaction tools that support exploration and
querying - Integration and Coupling of
- Statistical topic models, querying, graph
visualization, and databases - Software Tools and Applications for the
KDD-testbed - Netsight as an analysis tool
- Application of Author-topic type model (e.g.,
expert finder) - Entity Monitoring application (monitor data
sources over time with focused Web crawling) - Data Sets/Types (TBD)
- KDD-provided testbed data sets
- Digital libraries more CiteSeer, possibly Patent
DB, MEDLINE - Less structured text sources such as email
streams
81BACKUP SLIDES
82(No Transcript)
83(No Transcript)
84(No Transcript)
85(No Transcript)
86(No Transcript)
87(No Transcript)
88(No Transcript)
89(No Transcript)
90(No Transcript)
91Perplexities for true author and any random
author
A true author
A any author
Percentiles In distribution
92- Accuracy of author prediction as a function of
topics
of documents for which correct author was picked
93Heterogeneous Event-Entity Graph Analysis and
Query Language
- Analysis of link/graph data involves
- Subschema selection
- Selecting node and edge types of interest from
the graph schema - Subgraph selection
- Identifying relevant members of a group based on
(possibly imprecise) matching of edge/node
attributes or involvement in a given pattern of
relationship. - Decoration
- E.g. computation of pair-wise association
measures between individual entities (conditioned
on a context or third entity type) - Structural Grouping and Aggregation
- Node/edge grouping
- combination of decorations (or other attribute
values) for groups of entities at various levels. - Progressive Refinement
- carrying out the above operations in a
progressive and interactive manner. In
particular, user should be able to ask queries
based on results of previous queries.
94P(author and topic given a word)
- P(Ai,ZiW,Z\Zi,A\Ai) ? (CWZ
?)(CAZ?)/(?WCWZV?)
CWZ counts the number of times the same word, W,
(in the same or other documents) is assigned to
topic Z
CAZ counts the number of times the same author,
A, (in the same or other documents) is assigned
to topic Z
Keeping these counts speeds up the algorithm!
95Sampling over a query document
Preprocessing Assign to each word in the query
document an Author and a Topic
K Iterations (typically K10)
- For each word out of the N query words
- Derive the probability P(A, Z) conditioned on
the current assignments of query words and the
database words - Assign a new author, A, and topic, Z, according
to P(A,Z)
The probability for a topic is the averaged ratio
of words assigned to the topic per total words
P(Z)?Kt1CtZ/(KN) CtZ is the number of words
assigned in the t iteration to the z topic
96(No Transcript)
97(No Transcript)