Tools for Large Graph Mining - PowerPoint PPT Presentation

Loading...

PPT – Tools for Large Graph Mining PowerPoint presentation | free to download - id: 6bad59-YmY5Y



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Tools for Large Graph Mining

Description:

Tools for Large Graph Mining - Deepayan Chakrabarti Thesis Committee: Christos Faloutsos Chris Olston Guy Blelloch Jon Kleinberg (Cornell) Introduction Introduction ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Date added: 9 October 2019
Slides: 152
Provided by: Deepayan
Learn more at: http://www.cs.cmu.edu
Category:
Tags: graph | mine | mining | tools

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Tools for Large Graph Mining


1
Tools for Large Graph Mining
- Deepayan Chakrabarti
  • Thesis Committee
  • Christos Faloutsos
  • Chris Olston
  • Guy Blelloch
  • Jon Kleinberg (Cornell)

2
Introduction
Protein Interactions genomebiology.com
Internet Map lumeta.com
Food Web Martinez 91
? Graphs are ubiquitous
Friendship Network Moody 01
3
Introduction
  • What can we do with graphs?
  • How quickly will a disease spread on this graph?

Needle exchange networks of drug users Weeks
et al. 2002
4
Introduction
Key terrorist
  • What can we do with graphs?
  • How quickly will a disease spread on this graph?
  • Who are the strange bedfellows?
  • Who are the key people?

Hijacker network Krebs 01
? Graph analysis can have great impact
5
Graph Mining Two Paths
  • General issues
  • Realistic graph generation
  • Graph patterns and laws
  • Graph evolution over time?
  • Specific applications
  • Node grouping
  • Viral propagation
  • Frequent pattern mining
  • Fast message routing

6
Our Work
  • General issues
  • Realistic graph generation
  • Graph patterns and laws
  • Graph evolution over time?
  • Specific applications
  • Node grouping
  • Viral propagation
  • Frequent pattern mining
  • Fast message routing

7
Our Work
  • Node Grouping
  • Find natural partitions and outliers
    automatically.
  • Viral Propagation
  • Will a virus spread and become an epidemic?
  • Graph Generation
  • How can we mimic a given real-world graph?
  • General issues
  • Realistic graph generation
  • Graph patterns and laws
  • Graph evolution over time?
  • Specific applications
  • Node grouping
  • Viral propagation
  • Frequent pattern mining
  • Fast message routing

8
Roadmap
Focus of this talk
  • Specific applications
  • Node grouping
  • Viral propagation
  • General issues
  • Realistic graph generation
  • Graph patterns and laws

1
2
Find natural partitions and outliers
automatically
4
Conclusions
9
Node Grouping KDD 04
Simultaneously group customers and products,
or, documents and
words, or, users
and preferences
10
Node Grouping KDD 04
Both are fine
Customer Groups
Customer Groups
Product Groups
Product Groups
  • Row and column groups
  • need not be along a diagonal, and
  • need not be equal in number

11
Motivation
  • Visualization
  • Summarization
  • Detection of outlier nodes and edges
  • Compression, and others

12
Node Grouping
  • Desiderata
  • Simultaneously discover row and column groups
  • Fully Automatic No magic numbers
  • Scalable to large matrices
  • Online New data should not require full
    recomputations

13
Closely Related Work
  • Information Theoretic Co-clustering
    Dhillon/2003
  • Number of row and column groups must be specified
  • Desiderata
  • Simultaneously discover row and column groups
  • Fully Automatic No magic numbers
  • Scalable to large graphs
  • Online

14
Other Related Work
  • K-means and variants Pelleg/2000,
    Hamerly/2003
  • Frequent itemsets Agrawal/1994
  • Information Retrieval Deerwester1990,
    Hoffman/1999
  • Graph Partitioning Karypis/1998

Do not cluster rows and cols simultaneously
User must specify support
Choosing the number of concepts
Number of partitions Measure of imbalance between
clusters
15
What makes a cross-association good?
Why is this better?
  1. Similar nodes are grouped together
  2. As few groups as necessary

A few, homogeneous blocks
Good Clustering
Good Compression
implies
16
Main Idea
Good Compression
Good Clustering
implies
Binary Matrix
density pi1 of dots
Cost of describing ni1, ni0 and groups
Si
size H(pi1)
Si
Description Cost
Code Cost
17
Examples
high
low
Cost of describing ni1, ni0 and groups
Si
size H(pi1)
Si
Total Encoding Cost
Description Cost
Code Cost
high
low
m row group, n column group
18
What makes a cross-association good?
Why is this better?
versus
Row groups
Row groups
Column groups
Column groups
low
low
19
Formal problem statement
Given a binary matrix, Re-organize the rows and
columns into groups, and Choose the number of row
and column groups, to Minimize the total encoding
cost.
20
Formal problem statement
Note No Parameters
Given a binary matrix, Re-organize the rows and
columns into groups, and Choose the number of row
and column groups, to Minimize the total encoding
cost.
21
Algorithms
l 5 col groups
k 5 row groups
22
Algorithms
Find good groups for fixed k and l
Start with initial matrix
Final cross-association
Lower the encoding cost
Choose better values for k and l
23
Fixed k and l
Find good groups for fixed k and l
Start with initial matrix
Final cross-association
Lower the encoding cost
Choose better values for k and l
24
Fixed k and l
Re-assign for each row x re-assign it to the row
group which minimizes the code cost
  1. Row re-assigns
  2. Column re-assigns
  3. and repeat

25
Choosing k and l
Find good groups for fixed k and l
Start with initial matrix
Final cross-association
Lower the encoding cost
Choose better values for k and l
26
Choosing k and l
  • Split
  • Find the most inhomogeneous group.
  • Remove the rows/columns which make it
    inhomogeneous.
  • Create a new group for these rows/columns.

27
Algorithms
Find good groups for fixed k and l
Re-assigns
Start with initial matrix
Final cross-association
Lower the encoding cost
Choose better values for k and l
Splits
28
Experiments
l 5 col groups
k 5 row groups
Customer-Product graph with Zipfian sizes, no
noise
29
Experiments
l 8 col groups
k 6 row groups
Quasi block-diagonal graph with Zipfian sizes,
noise10
30
Experiments
l 3 col groups
k 2 row groups
White Noise graph we find the existing
spurious patterns
31
Experiments
  • CLASSIC
  • 3,893 documents
  • 4,303 words
  • 176,347 dots
  • Combination of 3 sources
  • MEDLINE (medical)
  • CISI (info. retrieval)
  • CRANFIELD (aerodynamics)

Documents
Words
32
Experiments
Documents
Words
CLASSIC graph of documents words k15, l19
33
Experiments
blood, disease, clinical, cell,
insipidus, alveolar, aortic, death,
MEDLINE (medical)
CLASSIC graph of documents words k15, l19
34
Experiments
abstract, notation, works, construct,
providing, studying, records, development,
MEDLINE (medical)
CISI (Information Retrieval)
CLASSIC graph of documents words k15, l19
35
Experiments
shape, nasa, leading, assumed,
MEDLINE (medical)
CISI (Information Retrieval)
CRANFIELD (aerodynamics)
CLASSIC graph of documents words k15, l19
36
Experiments
paint, examination, fall, raise, leave, based,
MEDLINE (medical)
CISI (Information Retrieval)
CRANFIELD (aerodynamics)
CLASSIC graph of documents words k15, l19
37
Experiments
  • GRANTS
  • 13,297 documents
  • 5,298 words
  • 805,063 dots

NSF Grant Proposals
Words in abstract
38
Experiments
NSF Grant Proposals
Words in abstract
GRANTS graph of documents words k41, l28
39
Experiments
encoding, characters, bind, nucleus
  • The Cross-Associations refer to topics
  • Genetics

GRANTS graph of documents words k41, l28
40
Experiments
coupling, deposition, plasma, beam
  • The Cross-Associations refer to topics
  • Genetics
  • Physics

GRANTS graph of documents words k41, l28
41
Experiments
manifolds, operators, harmonic
  • The Cross-Associations refer to topics
  • Genetics
  • Physics
  • Mathematics

GRANTS graph of documents words k41, l28
42
Experiments
Splits
Time (secs)
Re-assigns
Number of dots
Linear on the number of dots Scalable
43
Summary of Node Grouping
  • Desiderata
  • Simultaneously discover row and column groups
  • Fully Automatic No magic numbers
  • Scalable to large matrices
  • Online New data does not need full recomputation

44
Extensions
  • We can use the same MDL-based framework for other
    problems
  • Self-graphs
  • Detection of outlier edges

45
Extension 1 PKDD 04
  • Self-graphs, such as
  • Co-authorship graphs
  • Social networks
  • The Internet, and the World-wide Web

Authors
Bipartite graph
Self-graph
46
Extension 1 PKDD 04
  • Self-graphs
  • Rows and columns represent the same nodes
  • so row re-assigns affect column re-assigns

Authors
Bipartite graph
Self-graph
47
Experiments
  • DBLP dataset
  • 6,090 authors in
  • SIGMOD
  • ICDE
  • VLDB
  • PODS
  • ICDT
  • 175,494 co-citation or co-authorship links

Authors
Authors
48
Experiments
Authors
Author groups
Authors
Author groups
Stonebraker, DeWitt, Carey
k8 author groups found
49
Extension 2 PKDD 04
  • Outlier edges
  • Which links should not exist? (illegal
    contact/access?)
  • Which links are missing? (missing data?)

50
Extension 2 PKDD 04
Deviations from normality
Lower quality compression
Outliers
Find edges whose removal maximally reduces cost
51
Roadmap
  • Specific applications
  • Node grouping
  • Viral propagation
  • General issues
  • Realistic graph generation
  • Graph patterns and laws

1
2
Will a virus spread and become an epidemic?
4
Conclusions
52
The SIS (or flu) model
  • (Virus) birth rate ß probability than an
    infected neighbor attacks
  • (Virus) death rate d probability that an
    infected node heals
  • Cured Susceptible

Healthy
N2
N
N1
Infected
N3
Undirected network
53
The SIS (or flu) model
  • Competition between virus birth and death
  • Epidemic or extinction?
  • depends on the ratio ß/d
  • but also on the network topology

Epidemic or Extinction
Example of the effect of network topology
54
Epidemic threshold
  • The epidemic threshold t is the value such that
  • If ß/d lt t ? there is no epidemic
  • where ß birth rate, and d death rate

55
Previous models
Question What is the epidemic threshold?
Homogeneity assumption All nodes have the same
degree (but most graphs have power
laws) Mean-field assumption All nodes of the
same degree are equally affected (but
susceptibility should depend on position in
network too)
Answer 1 1/ltkgt Kephart and White 91,
93 Answer 2 ltkgt/ltk2gt Pastor-Satorras and
Vespignani 01
BUT
BUT
56
The full solution is intractable!
  • The full Markov Chain
  • has 2N states ? intractable
  • so, a simplification is needed.
  • Independence assumption
  • Probability that two neighbors are infected
    Product of individual probabilities of infection
  • This is a point estimate of the full Markov Chain.

57
Our model
  • A non-linear dynamical system (NLDS)
  • which makes no assumptions about the topology

1-pi,t 1-pi,t-1 dpi,t-1 . ?
(1-ß.Aji.pj,t-1)
N
j1
58
Epidemic threshold
  • Theorem 1 We have no epidemic if

ß/d lt t 1/ ?1,A
? ?1,A alone decides viral epidemics!
59
Recall the definition of eigenvalues
eigenvalue
A
X
X
?A
?1,A largest eigenvalue size of the
largest blob
60
Experiments (100-node Star)


61
Experiments (Oregon)
10,900 nodes and 31,180 edges
ß/d gt t (above threshold)
ß/d t (at the threshold)
ß/d lt t (below threshold)
62
Extensions
  • This dynamical-systems framework can exploited
    further
  • The rate of decay of the infection
  • Information survival thresholds in sensor/P2P
    networks

63
Extension 1
  • Below the threshold How quickly does an
    infection die out?
  • Theorem 2 Exponentially quickly

64
Experiment (10K Star Graph)
Linear on log-lin scale ? exponential decay
Number of infected nodes (log-scale)
Time-steps (linear-scale)
Score s ß/d ?1,A fraction of threshold
65
Experiment (Oregon Graph)
Linear on log-lin scale ? exponential decay
Number of infected nodes (log-scale)
Time-steps (linear-scale)
Score s ß/d ?1,A fraction of threshold
66
Extension 2
  • Information survival in sensor networks
    Leskovec, Faloutsos, Guestrin, Madden
  • Sensors gain new information

67
Extension 2
  • Information survival in sensor networks
    Leskovec, Faloutsos, Guestrin, Madden
  • Sensors gain new information
  • but they may die due to harsh environment or
    battery failure
  • so they occasionally try to transmit data to
    nearby sensors
  • and failed sensors are occasionally replaced.

68
Extension 2
  • Information survival in sensor networks
    Leskovec, Faloutsos, Guestrin, Madden
  • Sensors gain new information
  • but they may die due to harsh environment or
    battery failure
  • so they occasionally try to transmit data to
    nearby sensors
  • and failed sensors are occasionally replaced.
  • Under what conditions does the information
    survive?

69
Extension 2
  • Theorem 1 The information dies out
    exponentially quickly if

70
Roadmap
  • Specific applications
  • Node grouping
  • Viral propagation
  • General issues
  • Realistic graph generation
  • Graph patterns and laws

1
3
2
How can we generate a realistic graph, that
mimics a given real-world?
4
Conclusions
Skip
71
Experiments (Clickstream bipartite graph)
Some personal webpage
Clickstream R-MAT
x
Count
Yahoo, Google and others
Websites
Users
In-degree
72
Experiments (Clickstream bipartite graph)
Email-checking surfers
Clickstream R-MAT
x
Count
All-night surfers
Websites
Users
Out-degree
73
Experiments (Clickstream bipartite graph)
Count vs Out-degree
Count vs In-degree
Hop-plot
Singular value vs Rank
Left Network value
Right Network value
?R-MAT can match real-world graphs
74
Roadmap
  • Specific applications
  • Node grouping
  • Viral propagation
  • General issues
  • Realistic graph generation
  • Graph patterns and laws

1
2
4
Conclusions
75
Conclusions
  • Two paths in graph mining
  • Specific applications
  • Viral Propagation ? non-linear dynamical system,
    epidemic depends on largest eigenvalue
  • Node Grouping ? MDL-based approach for automatic
    grouping
  • General issues
  • Graph Patterns ? Marks of realism in a graph
  • Graph Generators ? R-MAT, a scalable generator
    matching many of the patterns

76
Software
  • http//www-2.cs.cmu.edu/deepay/Sw
  • CrossAssociations
  • To find natural node groups.
  • Used by anonymous large accounting firm.
  • Used by Intel Research, Cambridge, UK.
  • Used at UC, Riverside (net intrusion detection).
  • Used at the University of Porto, Portugal
  • NetMine
  • To extract graph patterns quickly build
    realistic graphs.
  • Used by Northrop Grumman corp.
  • F4
  • A non-linear time series forecasting package.

77
CROSS-ASSOCIATIONS
  • Why simultaneous grouping?
  • Differences from co-clustering and others?
  • Other parameter-fitting criteria?
  • Cost surface
  • Exact cost function
  • Exact complexity, wall-clock times
  • Soft clustering
  • Different weights for code and description costs?
  • Precision-recall for CLASSIC
  • Inter-group affinities
  • Collaborative filtering and recommendation
    systems?
  • CA versus bipartite cores
  • Extras
  • General comments on CA communities

78
Viral Propagation
  • Comparison with previous methods
  • Accuracy of dynamical system
  • Relationship with full Markov chain
  • Experiments on information survival threshold
  • Comparison with Infinite Particle Systems
  • Intuition behind the largest eigenvalue
  • Correlated failures

79
R-MAT
  • Graph patterns
  • Generator desiderata
  • Description of R-MAT
  • Experiments on a directed graph
  • R-MAT communities via Cross-Associations?
  • R-MAT versus tree-based generators

80
Graphs in general
  • Relational learning
  • Graph Kernels

81
Simultaneous grouping is useful
Sparse blocks, with little in common between rows
Index
82
Cross-Associations ? Co-clustering !
Information-theoretic co-clustering Cross-Associations
Lossy Compression. Approximates the original matrix, while trying to minimize KL-divergence. The number of row and column groups must be given by the user. Lossless Compression. Always provides complete information about the matrix, for any number of row and column groups. Chosen automatically using the MDL principle.
Index
83
Other parameter-fitting methods
  • The Gap statistic Tibshirani 01
  • Minimize the gap of log-likelihood of
    intra-cluster distances from the expected
    log-likelihood.
  • But
  • Needs a distance function between graph nodes
  • Needs a reference distribution
  • Needs multiple MCMC runs to remove variance due
    to sampling ? more time.

Index
84
Other parameter-fitting methods
  • Stability-based method Ben-Hur 02, 03
  • Run clustering multiple times on samples of data,
    for several values of k
  • For low k, clustering is stable for high k,
    unstable
  • Choose this transition point.
  • But
  • Needs many runs of the clustering algorithm
  • Arguments possible over definition of transition
    point

Index
85
Precision-Recall for CLASSIC
Index
86
Cost surface (total cost)
Surface plot
Contour plot
l
k
l
k
With increasing k and l Total cost decays very
rapidly initially, but then starts increasing
slowly
Index
87
Cost surface (code cost only)
Surface plot
Contour plot
l
k
l
k
With increasing k and l Code cost decays very
rapidly
Index
88
Encoding Cost Function
Total encoding cost log(k) log(l)
(cluster number) N.log(N)
M.log(M) (row/col order) S log(ai) S
log(bj) (cluster sizes) SS log(aibj1)
(block densities) SS aibj .
H(pi,j)
Description cost
Code cost
Index
89
Complexity of CA
  • O(E. (k2l2)) ignoring the number of re-assign
    iterations, which is typically low.

Index
90
Complexity of CA
Time / S(kl)
Number of edges
Index
91
Inter-group distances
Node Groups
Nodes
Nodes
Node Groups
Two groups are close
Merging them does not increase cost by much
distance(i,j) relative increase in cost on
merging i and j
Index
92
Inter-group distances
Grp1
5.5
Grp2
Node Groups
4.5
5.1
Grp3
Node Groups
Two groups are close
Merging them does not increase cost by much
distance(i,j) relative increase in cost on
merging i and j
Index
93
Experiments
Grp8
Grp1
Author groups
Author groups
Stonebraker, DeWitt, Carey
Inter-group distances can aid in visualization
Index
94
Collaborative filtering and recommendation systems
  • Q If someone likes a product X, will (s)he like
    product Y?
  • A Check if others who liked X also liked Y.
  • Focus on distances between people, typically
    cosine similarity
  • and not on clustering

Index
95
CA and bipartite cores related but different
Hubs
Authorities
A 3x2 bipartite core
Kumar et al. 1999 say that bipartite cores
correspond to communities.
Index
96
CA and bipartite cores related but different
  • CA finds two communities there one for hubs, and
    one for authorities.
  • We gracefully handle cases where a few links are
    missing.
  • CA considers connections between all sets of
    clusters, and not just two sets.
  • Not each node need belong to a non-trivial
    bipartite core.

CA is (informally) a generalization
Index
97
Comparison with soft clustering
  • Soft clustering ? each node belongs to each
    cluster with some probability
  • Hard clustering ? one cluster per node

Index
98
Comparison with soft clustering
  • Far more degrees of freedom
  • Parameter fitting is harder
  • Algorithms can be costlier
  • Hard clustering is better for exploratory data
    analysis
  • Some real-world problems require hard clustering
    ? e.g., fraud detection for accountants

Index
99
Weights for code cost vs description cost
  • Total 1. (code cost) 1. (description cost)
  • Physical meaning Total number of bits
  • Total a. (code cost) ß. (description cost)
  • Physical meaning Number of encoding bits
    under some prior

Index
100
Formula for re-assigns
Re-assign for each row x
Index
101
Choosing k and l
  • Split
  • Find the row group R with the maximum entropy per
    row
  • Choose the rows in R whose removal reduces the
    entropy per row in R
  • Send these rows to the new row group, and set
    kk1

Index
102
Experiments
  • Epinions dataset
  • 75,888 users
  • 508,960 dots, one dot per trust
    relationship
  • k19 groups found

Small dense core
Index
103
Comparison with previous methods
  • Our threshold subsumes the homogeneous model ?
    Proof
  • We are more accurate than the Mean-Field
    Assumption model.

Index
104
Comparison with previous methods
10K Star Graph
Index
105
Comparison with previous methods
Oregon Graph
Index
106
Accuracy of dynamical system
10K Star Graph
Index
107
Accuracy of dynamical system
Oregon Graph
Index
108
Accuracy of dynamical system
10K Star Graph
Index
109
Accuracy of dynamical system
Oregon Graph
Index
110
Relationship with full Markov Chain
  • The full Markov Chain is of the
    form Prob(infection at time t) Xt-1 Yt-1
    Zt-1
  • Independence assumption leads to a point estimate
    for Zt-1 ? non-linear dynamical system.
  • Still non-linear, but now tractable

Non-linear component
Index
111
Experiments Information survival
  • INTEL sensor map (54 nodes)
  • MIT sensor map (40 nodes)
  • and others

Index
112
Experiments Information survival
INTEL sensor map
Index
113
Survival threshold on INTEL
Index
114
Survival threshold on INTEL
Index
115
Experiments Information survival
MIT sensor map
Index
116
Survival threshold on MIT
Index
117
Survival threshold on MIT
Index
118
Infinite Particle Systems
  • Contact Process SIS model
  • Differences
  • Infinite graphs only ? the questions asked are
    different
  • Very specific topologies ? lattices, trees
  • Exact thresholds have not been found for these
    proving existence of thresholds is important
  • Our results match those on the finite line graph
    Durrett 88

Index
119
Intuition behind the largest eigenvalue
  • Approximately ? size of the largest blob
  • Consider the special case of a caveman graph

Largest eigenvalue 4
Index
120
Intuition behind the largest eigenvalue
  • Approximately ? size of the largest blob

Largest eigenvalue 4.016
Index
121
Graph Patterns
  • Power Laws

The epinions graph with 75,888 nodes
and 508,960 edges
Count vs Indegree
Index
122
Graph Patterns
  • Power Laws

The epinions graph with 75,888 nodes
and 508,960 edges
Count vs Indegree
Index
123
Graph Patterns
  • Power Laws and deviations (DGX/Lognormals Bi
    01)

Count
Degree
Index
124
Graph Patterns
  • Power Laws and deviations
  • Small-world
  • Community effect

reachable pairs
hops
Index
125
Graph Generator Desiderata
  • Other desiderata
  • Few parameters
  • Fast parameter-fitting
  • Scalable graph generation
  • Simple extension to undirected, bipartite and
    weighted graphs
  • Power Laws and deviations
  • Small-world
  • Community effect

Most current graph generators fail to match some
of these.
Index
126
The R-MAT generator
  • SIAM DM04

Intuition The 80-20 law
  • Subdivide the adjacency matrix
  • and choose one quadrant with probability (a,b,c,d)

b (0.1)
a (0.5)
d (0.25)
c (0.15)
Index
127
The R-MAT generator
  • SIAM DM04

Intuition The 80-20 law
  • Subdivide the adjacency matrix
  • and choose one quadrant with probability
    (a,b,c,d)
  • Recurse till we reach a 11 cell
  • where we place an edge
  • and repeat for all edges.

a
b
a
c
d
d
c
Index
128
The R-MAT generator
  • SIAM DM04

Intuition The 80-20 law
  • Only 3 parameters a, b and c (d 1-a-b-c).
  • We have a fast parameter fitting algorithm.

a
b
a
c
d
d
c
Index
129
Experiments (Epinions directed graph)
Count vs Indegree
Count vs Outdegree
Hop-plot
Count vs Stress
Eigenvalue vs Rank
Network value
?R-MAT matches directed graphs
Index
130
R-MAT communities and Cross-Associations
  • R-MAT builds communities in graphs, and
    Cross-Associations finds them.
  • Relationship?
  • R-MAT builds a hierarchy of communities, while CA
    finds a flat set of communities
  • Linkage in the sizes of communities found by CA
  • When the R-MAT parameters are very skewed, the
    community sizes for CA are skewed
  • and vice versa

Index
131
R-MAT and tree-based generators
  • Recursive splitting in R-MAT following a tree
    from root to leaf.
  • Relationship with other tree-based generators
    Kleinberg 01, Watts 02?
  • The R-MAT tree has edges as leaves, the others
    have nodes
  • Tree-distance between nodes is used to connect
    nodes in other generators, but what does
    tree-distance between edges mean?

Index
132
Comparison with relational learning
Relational Learning (typical) Graph Mining (typical)
Aims to find small structure/patterns at the local level Labeled nodes and edges Semantics of labels are important Algorithms are typically costlier Emphasis on global aspects of large graphs Unlabeled graphs More focused on topological structure and properties Scalability is more important
Index
133
OTHER WORK
  • OTHER WORK

134
Other Work
  • Time Series Prediction CIKM 2002
  • We use the fractal dimension of the data
  • This is related to chaos theory
  • and Lyapunov exponents

135
Other Work
Logistic Parabola
  • Time Series Prediction CIKM 2002

136
Other Work
Lorenz attractor
  • Time Series Prediction CIKM 2002

137
Other Work
Laser fluctuations
  • Time Series Prediction CIKM 2002

138
Other Work
  • Adaptive histograms with error guarantees
    Ashraf Aboulnaga, Yufei Tao, Christos Faloutsos

Insertions, deletions
Count
  • Maintain count probabilities for buckets
  • to give statistically correct query result-size
    estimation
  • and query feedback

Salary
139
Other Work
  • User-personalization
  • Patent number 6,611,834 (IBM)
  • Relevance feedback in multimedia image search
  • Filed for patent (IBM)
  • Building 3D models using robot camera and
    rangefinder data ICML 2001

140
EXTRAS
141
Conclusions
  • Two paths in graph mining
  • Specific applications
  • Viral Propagation ? Resilience testing,
    information dissemination, rumor spreading
  • Node Grouping ? automatically grouping nodes, AND
    finding the correct number of groups
  • References
  • Fully automatic Cross-Associations, by
    Chakrabarti, Papadimitriou, Modha and Faloutsos,
    in KDD 2004
  • AutoPart Parameter-free graph partitioning and
    Outlier detection, by Chakrabarti, in PKDD
    2004
  • Epidemic spreading in real networks An
    eigenvalue viewpoint, by Wang, Chakrabarti,
    Wang and Faloutsos, in SRDS 2003

142
Conclusions
  • Two paths in graph mining
  • Specific applications
  • General issues
  • Graph Patterns ? Marks of realism in a graph
  • Graph Generators ? R-MAT, a fast, scalable
    generator matching many of the patterns
  • References
  • R-MAT A recursive model for graph mining, by
    Chakrabarti, Zhan and Faloutsos in SIAM Data
    Mining 2004.
  • NetMine New mining tools for large graphs, by
    Chakrabarti, Zhan, Blandford, Faloutsos and
    Blelloch, in the SIAM 2004 Workshop on Link
    analysis, counter-terrorism and privacy

143
Other References
  • F4 Large Scale Automated Forecasting using
    Fractals, by D. Chakrabarti and C. Faloutsos, in
    CIKM 2002.
  • Using EM to Learn 3D Models of Indoor
    Environments with Mobile Robots, by Y. Liu, R.
    Emery, D. Chakrabarti, W. Burgard and S. Thrun,
    in ICML 2001
  • Graph Mining Laws, Generators and Algorithms, by
    D. Chakrabarti and C. Faloutsos, under
    submission to ACM Computing Surveys

144
References --- graphs
  1. R-MAT A recursive model for graph mining, by D.
    Chakrabarti, Y. Zhan, C. Faloutsos in SIAM Data
    Mining 2004.
  2. Epidemic spreading in real networks An
    eigenvalue viewpoint, by Y. Wang, D. Chakrabarti,
    C. Wang and C. Faloutsos, in SRDS 2003
  3. Fully automatic Cross-Associations, by D.
    Chakrabarti, S. Papadimitriou, D. Modha and C.
    Faloutsos, in KDD 2004
  4. AutoPart Parameter-free graph partitioning and
    Outlier detection, by D. Chakrabarti, in PKDD
    2004
  5. NetMine New mining tools for large graphs, by D.
    Chakrabarti, Y. Zhan, D. Blandford, C. Faloutsos
    and G. Blelloch, in the SIAM 2004 Workshop on
    Link analysis, counter-terrorism and privacy

145
Roadmap
  • Specific applications
  • Node grouping
  • Viral propagation
  • General issues
  • Realistic graph generation
  • Graph patterns and laws

2
4
Other Work
Conclusions
146
Experiments (Clickstream bipartite graph)
Some personal webpage
Clickstream

Count
Yahoo, Google and others
Websites
Users
In-degree
147
Experiments (Clickstream bipartite graph)
Email-checking surfers
Clickstream

Count
All-night surfers
Websites
Users
Out-degree
148
Experiments (Clickstream bipartite graph)
Clickstream R-MAT
Reachable pairs
Websites
Users
Hops
149
Graph Generation
  • Important for
  • Simulations of new algorithms
  • Compression using a good graph generation model
  • Insight into the graph formation process
  • Our R-MAT (Recursive MATrix) generator can match
    many common graph patterns.

150
Recall the definition of eigenvalues
?A eigenvalue of A ?1,A largest eigenvalue
A
X
X
?A
ß/d lt t 1/ ?1,A
151
Tools for Large Graph Mining
  • Deepayan Chakrabarti
  • Carnegie Mellon University
About PowerShow.com