Loading...

PPT – Tools for Large Graph Mining PowerPoint presentation | free to download - id: 6bad59-YmY5Y

The Adobe Flash plugin is needed to view this content

Tools for Large Graph Mining

- Deepayan Chakrabarti

- Thesis Committee
- Christos Faloutsos
- Chris Olston
- Guy Blelloch
- Jon Kleinberg (Cornell)

Introduction

Protein Interactions genomebiology.com

Internet Map lumeta.com

Food Web Martinez 91

? Graphs are ubiquitous

Friendship Network Moody 01

Introduction

- What can we do with graphs?
- How quickly will a disease spread on this graph?

Needle exchange networks of drug users Weeks

et al. 2002

Introduction

Key terrorist

- What can we do with graphs?
- How quickly will a disease spread on this graph?
- Who are the strange bedfellows?
- Who are the key people?

Hijacker network Krebs 01

? Graph analysis can have great impact

Graph Mining Two Paths

- General issues
- Realistic graph generation
- Graph patterns and laws
- Graph evolution over time?

- Specific applications
- Node grouping
- Viral propagation
- Frequent pattern mining
- Fast message routing

Our Work

- General issues
- Realistic graph generation
- Graph patterns and laws
- Graph evolution over time?

- Specific applications
- Node grouping
- Viral propagation
- Frequent pattern mining
- Fast message routing

Our Work

- Node Grouping
- Find natural partitions and outliers

automatically. - Viral Propagation
- Will a virus spread and become an epidemic?
- Graph Generation
- How can we mimic a given real-world graph?

- General issues
- Realistic graph generation
- Graph patterns and laws
- Graph evolution over time?

- Specific applications
- Node grouping
- Viral propagation
- Frequent pattern mining
- Fast message routing

Roadmap

Focus of this talk

- Specific applications
- Node grouping
- Viral propagation

- General issues
- Realistic graph generation
- Graph patterns and laws

1

2

Find natural partitions and outliers

automatically

4

Conclusions

Node Grouping KDD 04

Simultaneously group customers and products,

or, documents and

words, or, users

and preferences

Node Grouping KDD 04

Both are fine

Customer Groups

Customer Groups

Product Groups

Product Groups

- Row and column groups
- need not be along a diagonal, and
- need not be equal in number

Motivation

- Visualization
- Summarization
- Detection of outlier nodes and edges
- Compression, and others

Node Grouping

- Desiderata
- Simultaneously discover row and column groups
- Fully Automatic No magic numbers
- Scalable to large matrices
- Online New data should not require full

recomputations

Closely Related Work

- Information Theoretic Co-clustering

Dhillon/2003 - Number of row and column groups must be specified

- Desiderata
- Simultaneously discover row and column groups
- Fully Automatic No magic numbers
- Scalable to large graphs
- Online

Other Related Work

- K-means and variants Pelleg/2000,

Hamerly/2003 - Frequent itemsets Agrawal/1994
- Information Retrieval Deerwester1990,

Hoffman/1999 - Graph Partitioning Karypis/1998

Do not cluster rows and cols simultaneously

User must specify support

Choosing the number of concepts

Number of partitions Measure of imbalance between

clusters

What makes a cross-association good?

Why is this better?

- Similar nodes are grouped together
- As few groups as necessary

A few, homogeneous blocks

Good Clustering

Good Compression

implies

Main Idea

Good Compression

Good Clustering

implies

Binary Matrix

density pi1 of dots

Cost of describing ni1, ni0 and groups

Si

size H(pi1)

Si

Description Cost

Code Cost

Examples

high

low

Cost of describing ni1, ni0 and groups

Si

size H(pi1)

Si

Total Encoding Cost

Description Cost

Code Cost

high

low

m row group, n column group

What makes a cross-association good?

Why is this better?

versus

Row groups

Row groups

Column groups

Column groups

low

low

Formal problem statement

Given a binary matrix, Re-organize the rows and

columns into groups, and Choose the number of row

and column groups, to Minimize the total encoding

cost.

Formal problem statement

Note No Parameters

Given a binary matrix, Re-organize the rows and

columns into groups, and Choose the number of row

and column groups, to Minimize the total encoding

cost.

Algorithms

l 5 col groups

k 5 row groups

Algorithms

Find good groups for fixed k and l

Start with initial matrix

Final cross-association

Lower the encoding cost

Choose better values for k and l

Fixed k and l

Find good groups for fixed k and l

Start with initial matrix

Final cross-association

Lower the encoding cost

Choose better values for k and l

Fixed k and l

Re-assign for each row x re-assign it to the row

group which minimizes the code cost

- Row re-assigns
- Column re-assigns
- and repeat

Choosing k and l

Find good groups for fixed k and l

Start with initial matrix

Final cross-association

Lower the encoding cost

Choose better values for k and l

Choosing k and l

- Split
- Find the most inhomogeneous group.
- Remove the rows/columns which make it

inhomogeneous. - Create a new group for these rows/columns.

Algorithms

Find good groups for fixed k and l

Re-assigns

Start with initial matrix

Final cross-association

Lower the encoding cost

Choose better values for k and l

Splits

Experiments

l 5 col groups

k 5 row groups

Customer-Product graph with Zipfian sizes, no

noise

Experiments

l 8 col groups

k 6 row groups

Quasi block-diagonal graph with Zipfian sizes,

noise10

Experiments

l 3 col groups

k 2 row groups

White Noise graph we find the existing

spurious patterns

Experiments

- CLASSIC
- 3,893 documents
- 4,303 words
- 176,347 dots
- Combination of 3 sources
- MEDLINE (medical)
- CISI (info. retrieval)
- CRANFIELD (aerodynamics)

Documents

Words

Experiments

Documents

Words

CLASSIC graph of documents words k15, l19

Experiments

blood, disease, clinical, cell,

insipidus, alveolar, aortic, death,

MEDLINE (medical)

CLASSIC graph of documents words k15, l19

Experiments

abstract, notation, works, construct,

providing, studying, records, development,

MEDLINE (medical)

CISI (Information Retrieval)

CLASSIC graph of documents words k15, l19

Experiments

shape, nasa, leading, assumed,

MEDLINE (medical)

CISI (Information Retrieval)

CRANFIELD (aerodynamics)

CLASSIC graph of documents words k15, l19

Experiments

paint, examination, fall, raise, leave, based,

MEDLINE (medical)

CISI (Information Retrieval)

CRANFIELD (aerodynamics)

CLASSIC graph of documents words k15, l19

Experiments

- GRANTS
- 13,297 documents
- 5,298 words
- 805,063 dots

NSF Grant Proposals

Words in abstract

Experiments

NSF Grant Proposals

Words in abstract

GRANTS graph of documents words k41, l28

Experiments

encoding, characters, bind, nucleus

- The Cross-Associations refer to topics
- Genetics

GRANTS graph of documents words k41, l28

Experiments

coupling, deposition, plasma, beam

- The Cross-Associations refer to topics
- Genetics
- Physics

GRANTS graph of documents words k41, l28

Experiments

manifolds, operators, harmonic

- The Cross-Associations refer to topics
- Genetics
- Physics
- Mathematics

GRANTS graph of documents words k41, l28

Experiments

Splits

Time (secs)

Re-assigns

Number of dots

Linear on the number of dots Scalable

Summary of Node Grouping

- Desiderata
- Simultaneously discover row and column groups
- Fully Automatic No magic numbers
- Scalable to large matrices
- Online New data does not need full recomputation

Extensions

- We can use the same MDL-based framework for other

problems - Self-graphs
- Detection of outlier edges

Extension 1 PKDD 04

- Self-graphs, such as
- Co-authorship graphs
- Social networks
- The Internet, and the World-wide Web

Authors

Bipartite graph

Self-graph

Extension 1 PKDD 04

- Self-graphs
- Rows and columns represent the same nodes
- so row re-assigns affect column re-assigns

Authors

Bipartite graph

Self-graph

Experiments

- DBLP dataset
- 6,090 authors in
- SIGMOD
- ICDE
- VLDB
- PODS
- ICDT
- 175,494 co-citation or co-authorship links

Authors

Authors

Experiments

Authors

Author groups

Authors

Author groups

Stonebraker, DeWitt, Carey

k8 author groups found

Extension 2 PKDD 04

- Outlier edges
- Which links should not exist? (illegal

contact/access?) - Which links are missing? (missing data?)

Extension 2 PKDD 04

Deviations from normality

Lower quality compression

Outliers

Find edges whose removal maximally reduces cost

Roadmap

- Specific applications
- Node grouping
- Viral propagation

- General issues
- Realistic graph generation
- Graph patterns and laws

1

2

Will a virus spread and become an epidemic?

4

Conclusions

The SIS (or flu) model

- (Virus) birth rate ß probability than an

infected neighbor attacks - (Virus) death rate d probability that an

infected node heals - Cured Susceptible

Healthy

N2

N

N1

Infected

N3

Undirected network

The SIS (or flu) model

- Competition between virus birth and death
- Epidemic or extinction?
- depends on the ratio ß/d
- but also on the network topology

Epidemic or Extinction

Example of the effect of network topology

Epidemic threshold

- The epidemic threshold t is the value such that
- If ß/d lt t ? there is no epidemic
- where ß birth rate, and d death rate

Previous models

Question What is the epidemic threshold?

Homogeneity assumption All nodes have the same

degree (but most graphs have power

laws) Mean-field assumption All nodes of the

same degree are equally affected (but

susceptibility should depend on position in

network too)

Answer 1 1/ltkgt Kephart and White 91,

93 Answer 2 ltkgt/ltk2gt Pastor-Satorras and

Vespignani 01

BUT

BUT

The full solution is intractable!

- The full Markov Chain
- has 2N states ? intractable
- so, a simplification is needed.
- Independence assumption
- Probability that two neighbors are infected

Product of individual probabilities of infection - This is a point estimate of the full Markov Chain.

Our model

- A non-linear dynamical system (NLDS)
- which makes no assumptions about the topology

1-pi,t 1-pi,t-1 dpi,t-1 . ?

(1-ß.Aji.pj,t-1)

N

j1

Epidemic threshold

- Theorem 1 We have no epidemic if

ß/d lt t 1/ ?1,A

? ?1,A alone decides viral epidemics!

Recall the definition of eigenvalues

eigenvalue

A

X

X

?A

?1,A largest eigenvalue size of the

largest blob

Experiments (100-node Star)

Experiments (Oregon)

10,900 nodes and 31,180 edges

ß/d gt t (above threshold)

ß/d t (at the threshold)

ß/d lt t (below threshold)

Extensions

- This dynamical-systems framework can exploited

further - The rate of decay of the infection
- Information survival thresholds in sensor/P2P

networks

Extension 1

- Below the threshold How quickly does an

infection die out? - Theorem 2 Exponentially quickly

Experiment (10K Star Graph)

Linear on log-lin scale ? exponential decay

Number of infected nodes (log-scale)

Time-steps (linear-scale)

Score s ß/d ?1,A fraction of threshold

Experiment (Oregon Graph)

Linear on log-lin scale ? exponential decay

Number of infected nodes (log-scale)

Time-steps (linear-scale)

Score s ß/d ?1,A fraction of threshold

Extension 2

- Information survival in sensor networks

Leskovec, Faloutsos, Guestrin, Madden

- Sensors gain new information

Extension 2

- Information survival in sensor networks

Leskovec, Faloutsos, Guestrin, Madden

- Sensors gain new information
- but they may die due to harsh environment or

battery failure - so they occasionally try to transmit data to

nearby sensors - and failed sensors are occasionally replaced.

Extension 2

- Information survival in sensor networks

Leskovec, Faloutsos, Guestrin, Madden

- Sensors gain new information
- but they may die due to harsh environment or

battery failure - so they occasionally try to transmit data to

nearby sensors - and failed sensors are occasionally replaced.
- Under what conditions does the information

survive?

Extension 2

- Theorem 1 The information dies out

exponentially quickly if

Roadmap

- Specific applications
- Node grouping
- Viral propagation

- General issues
- Realistic graph generation
- Graph patterns and laws

1

3

2

How can we generate a realistic graph, that

mimics a given real-world?

4

Conclusions

Skip

Experiments (Clickstream bipartite graph)

Some personal webpage

Clickstream R-MAT

x

Count

Yahoo, Google and others

Websites

Users

In-degree

Experiments (Clickstream bipartite graph)

Email-checking surfers

Clickstream R-MAT

x

Count

All-night surfers

Websites

Users

Out-degree

Experiments (Clickstream bipartite graph)

Count vs Out-degree

Count vs In-degree

Hop-plot

Singular value vs Rank

Left Network value

Right Network value

?R-MAT can match real-world graphs

Roadmap

- Specific applications
- Node grouping
- Viral propagation

- General issues
- Realistic graph generation
- Graph patterns and laws

1

2

4

Conclusions

Conclusions

- Two paths in graph mining
- Specific applications
- Viral Propagation ? non-linear dynamical system,

epidemic depends on largest eigenvalue - Node Grouping ? MDL-based approach for automatic

grouping - General issues
- Graph Patterns ? Marks of realism in a graph
- Graph Generators ? R-MAT, a scalable generator

matching many of the patterns

Software

- http//www-2.cs.cmu.edu/deepay/Sw
- CrossAssociations
- To find natural node groups.
- Used by anonymous large accounting firm.
- Used by Intel Research, Cambridge, UK.
- Used at UC, Riverside (net intrusion detection).
- Used at the University of Porto, Portugal
- NetMine
- To extract graph patterns quickly build

realistic graphs. - Used by Northrop Grumman corp.
- F4
- A non-linear time series forecasting package.

CROSS-ASSOCIATIONS

- Why simultaneous grouping?
- Differences from co-clustering and others?
- Other parameter-fitting criteria?
- Cost surface
- Exact cost function
- Exact complexity, wall-clock times
- Soft clustering
- Different weights for code and description costs?

- Precision-recall for CLASSIC
- Inter-group affinities
- Collaborative filtering and recommendation

systems? - CA versus bipartite cores
- Extras
- General comments on CA communities

Viral Propagation

- Comparison with previous methods
- Accuracy of dynamical system
- Relationship with full Markov chain
- Experiments on information survival threshold
- Comparison with Infinite Particle Systems
- Intuition behind the largest eigenvalue
- Correlated failures

R-MAT

- Graph patterns
- Generator desiderata
- Description of R-MAT
- Experiments on a directed graph
- R-MAT communities via Cross-Associations?
- R-MAT versus tree-based generators

Graphs in general

- Relational learning
- Graph Kernels

Simultaneous grouping is useful

Sparse blocks, with little in common between rows

Index

Cross-Associations ? Co-clustering !

Information-theoretic co-clustering Cross-Associations

Lossy Compression. Approximates the original matrix, while trying to minimize KL-divergence. The number of row and column groups must be given by the user. Lossless Compression. Always provides complete information about the matrix, for any number of row and column groups. Chosen automatically using the MDL principle.

Index

Other parameter-fitting methods

- The Gap statistic Tibshirani 01
- Minimize the gap of log-likelihood of

intra-cluster distances from the expected

log-likelihood. - But
- Needs a distance function between graph nodes
- Needs a reference distribution
- Needs multiple MCMC runs to remove variance due

to sampling ? more time.

Index

Other parameter-fitting methods

- Stability-based method Ben-Hur 02, 03
- Run clustering multiple times on samples of data,

for several values of k - For low k, clustering is stable for high k,

unstable - Choose this transition point.
- But
- Needs many runs of the clustering algorithm
- Arguments possible over definition of transition

point

Index

Precision-Recall for CLASSIC

Index

Cost surface (total cost)

Surface plot

Contour plot

l

k

l

k

With increasing k and l Total cost decays very

rapidly initially, but then starts increasing

slowly

Index

Cost surface (code cost only)

Surface plot

Contour plot

l

k

l

k

With increasing k and l Code cost decays very

rapidly

Index

Encoding Cost Function

Total encoding cost log(k) log(l)

(cluster number) N.log(N)

M.log(M) (row/col order) S log(ai) S

log(bj) (cluster sizes) SS log(aibj1)

(block densities) SS aibj .

H(pi,j)

Description cost

Code cost

Index

Complexity of CA

- O(E. (k2l2)) ignoring the number of re-assign

iterations, which is typically low.

Index

Complexity of CA

Time / S(kl)

Number of edges

Index

Inter-group distances

Node Groups

Nodes

Nodes

Node Groups

Two groups are close

Merging them does not increase cost by much

distance(i,j) relative increase in cost on

merging i and j

Index

Inter-group distances

Grp1

5.5

Grp2

Node Groups

4.5

5.1

Grp3

Node Groups

Two groups are close

Merging them does not increase cost by much

distance(i,j) relative increase in cost on

merging i and j

Index

Experiments

Grp8

Grp1

Author groups

Author groups

Stonebraker, DeWitt, Carey

Inter-group distances can aid in visualization

Index

Collaborative filtering and recommendation systems

- Q If someone likes a product X, will (s)he like

product Y? - A Check if others who liked X also liked Y.
- Focus on distances between people, typically

cosine similarity - and not on clustering

Index

CA and bipartite cores related but different

Hubs

Authorities

A 3x2 bipartite core

Kumar et al. 1999 say that bipartite cores

correspond to communities.

Index

CA and bipartite cores related but different

- CA finds two communities there one for hubs, and

one for authorities. - We gracefully handle cases where a few links are

missing. - CA considers connections between all sets of

clusters, and not just two sets. - Not each node need belong to a non-trivial

bipartite core.

CA is (informally) a generalization

Index

Comparison with soft clustering

- Soft clustering ? each node belongs to each

cluster with some probability - Hard clustering ? one cluster per node

Index

Comparison with soft clustering

- Far more degrees of freedom
- Parameter fitting is harder
- Algorithms can be costlier
- Hard clustering is better for exploratory data

analysis - Some real-world problems require hard clustering

? e.g., fraud detection for accountants

Index

Weights for code cost vs description cost

- Total 1. (code cost) 1. (description cost)
- Physical meaning Total number of bits
- Total a. (code cost) ß. (description cost)
- Physical meaning Number of encoding bits

under some prior

Index

Formula for re-assigns

Re-assign for each row x

Index

Choosing k and l

- Split
- Find the row group R with the maximum entropy per

row - Choose the rows in R whose removal reduces the

entropy per row in R - Send these rows to the new row group, and set

kk1

Index

Experiments

- Epinions dataset
- 75,888 users
- 508,960 dots, one dot per trust

relationship - k19 groups found

Small dense core

Index

Comparison with previous methods

- Our threshold subsumes the homogeneous model ?

Proof - We are more accurate than the Mean-Field

Assumption model.

Index

Comparison with previous methods

10K Star Graph

Index

Comparison with previous methods

Oregon Graph

Index

Accuracy of dynamical system

10K Star Graph

Index

Accuracy of dynamical system

Oregon Graph

Index

Accuracy of dynamical system

10K Star Graph

Index

Accuracy of dynamical system

Oregon Graph

Index

Relationship with full Markov Chain

- The full Markov Chain is of the

form Prob(infection at time t) Xt-1 Yt-1

Zt-1 - Independence assumption leads to a point estimate

for Zt-1 ? non-linear dynamical system. - Still non-linear, but now tractable

Non-linear component

Index

Experiments Information survival

- INTEL sensor map (54 nodes)
- MIT sensor map (40 nodes)
- and others

Index

Experiments Information survival

INTEL sensor map

Index

Survival threshold on INTEL

Index

Survival threshold on INTEL

Index

Experiments Information survival

MIT sensor map

Index

Survival threshold on MIT

Index

Survival threshold on MIT

Index

Infinite Particle Systems

- Contact Process SIS model
- Differences
- Infinite graphs only ? the questions asked are

different - Very specific topologies ? lattices, trees
- Exact thresholds have not been found for these

proving existence of thresholds is important - Our results match those on the finite line graph

Durrett 88

Index

Intuition behind the largest eigenvalue

- Approximately ? size of the largest blob
- Consider the special case of a caveman graph

Largest eigenvalue 4

Index

Intuition behind the largest eigenvalue

- Approximately ? size of the largest blob

Largest eigenvalue 4.016

Index

Graph Patterns

- Power Laws

The epinions graph with 75,888 nodes

and 508,960 edges

Count vs Indegree

Index

Graph Patterns

- Power Laws

The epinions graph with 75,888 nodes

and 508,960 edges

Count vs Indegree

Index

Graph Patterns

- Power Laws and deviations (DGX/Lognormals Bi

01)

Count

Degree

Index

Graph Patterns

- Power Laws and deviations
- Small-world
- Community effect

reachable pairs

hops

Index

Graph Generator Desiderata

- Other desiderata
- Few parameters
- Fast parameter-fitting
- Scalable graph generation
- Simple extension to undirected, bipartite and

weighted graphs

- Power Laws and deviations
- Small-world
- Community effect

Most current graph generators fail to match some

of these.

Index

The R-MAT generator

- SIAM DM04

Intuition The 80-20 law

- Subdivide the adjacency matrix
- and choose one quadrant with probability (a,b,c,d)

b (0.1)

a (0.5)

d (0.25)

c (0.15)

Index

The R-MAT generator

- SIAM DM04

Intuition The 80-20 law

- Subdivide the adjacency matrix
- and choose one quadrant with probability

(a,b,c,d) - Recurse till we reach a 11 cell
- where we place an edge
- and repeat for all edges.

a

b

a

c

d

d

c

Index

The R-MAT generator

- SIAM DM04

Intuition The 80-20 law

- Only 3 parameters a, b and c (d 1-a-b-c).
- We have a fast parameter fitting algorithm.

a

b

a

c

d

d

c

Index

Experiments (Epinions directed graph)

Count vs Indegree

Count vs Outdegree

Hop-plot

Count vs Stress

Eigenvalue vs Rank

Network value

?R-MAT matches directed graphs

Index

R-MAT communities and Cross-Associations

- R-MAT builds communities in graphs, and

Cross-Associations finds them. - Relationship?
- R-MAT builds a hierarchy of communities, while CA

finds a flat set of communities - Linkage in the sizes of communities found by CA
- When the R-MAT parameters are very skewed, the

community sizes for CA are skewed - and vice versa

Index

R-MAT and tree-based generators

- Recursive splitting in R-MAT following a tree

from root to leaf. - Relationship with other tree-based generators

Kleinberg 01, Watts 02? - The R-MAT tree has edges as leaves, the others

have nodes - Tree-distance between nodes is used to connect

nodes in other generators, but what does

tree-distance between edges mean?

Index

Comparison with relational learning

Relational Learning (typical) Graph Mining (typical)

Aims to find small structure/patterns at the local level Labeled nodes and edges Semantics of labels are important Algorithms are typically costlier Emphasis on global aspects of large graphs Unlabeled graphs More focused on topological structure and properties Scalability is more important

Index

OTHER WORK

- OTHER WORK

Other Work

- Time Series Prediction CIKM 2002
- We use the fractal dimension of the data
- This is related to chaos theory
- and Lyapunov exponents

Other Work

Logistic Parabola

- Time Series Prediction CIKM 2002

Other Work

Lorenz attractor

- Time Series Prediction CIKM 2002

Other Work

Laser fluctuations

- Time Series Prediction CIKM 2002

Other Work

- Adaptive histograms with error guarantees

Ashraf Aboulnaga, Yufei Tao, Christos Faloutsos

Insertions, deletions

Count

- Maintain count probabilities for buckets
- to give statistically correct query result-size

estimation - and query feedback

Salary

Other Work

- User-personalization
- Patent number 6,611,834 (IBM)
- Relevance feedback in multimedia image search
- Filed for patent (IBM)
- Building 3D models using robot camera and

rangefinder data ICML 2001

EXTRAS

Conclusions

- Two paths in graph mining
- Specific applications
- Viral Propagation ? Resilience testing,

information dissemination, rumor spreading - Node Grouping ? automatically grouping nodes, AND

finding the correct number of groups

- References
- Fully automatic Cross-Associations, by

Chakrabarti, Papadimitriou, Modha and Faloutsos,

in KDD 2004 - AutoPart Parameter-free graph partitioning and

Outlier detection, by Chakrabarti, in PKDD

2004 - Epidemic spreading in real networks An

eigenvalue viewpoint, by Wang, Chakrabarti,

Wang and Faloutsos, in SRDS 2003

Conclusions

- Two paths in graph mining
- Specific applications
- General issues
- Graph Patterns ? Marks of realism in a graph
- Graph Generators ? R-MAT, a fast, scalable

generator matching many of the patterns

- References
- R-MAT A recursive model for graph mining, by

Chakrabarti, Zhan and Faloutsos in SIAM Data

Mining 2004. - NetMine New mining tools for large graphs, by

Chakrabarti, Zhan, Blandford, Faloutsos and

Blelloch, in the SIAM 2004 Workshop on Link

analysis, counter-terrorism and privacy

Other References

- F4 Large Scale Automated Forecasting using

Fractals, by D. Chakrabarti and C. Faloutsos, in

CIKM 2002. - Using EM to Learn 3D Models of Indoor

Environments with Mobile Robots, by Y. Liu, R.

Emery, D. Chakrabarti, W. Burgard and S. Thrun,

in ICML 2001 - Graph Mining Laws, Generators and Algorithms, by

D. Chakrabarti and C. Faloutsos, under

submission to ACM Computing Surveys

References --- graphs

- R-MAT A recursive model for graph mining, by D.

Chakrabarti, Y. Zhan, C. Faloutsos in SIAM Data

Mining 2004. - Epidemic spreading in real networks An

eigenvalue viewpoint, by Y. Wang, D. Chakrabarti,

C. Wang and C. Faloutsos, in SRDS 2003 - Fully automatic Cross-Associations, by D.

Chakrabarti, S. Papadimitriou, D. Modha and C.

Faloutsos, in KDD 2004 - AutoPart Parameter-free graph partitioning and

Outlier detection, by D. Chakrabarti, in PKDD

2004 - NetMine New mining tools for large graphs, by D.

Chakrabarti, Y. Zhan, D. Blandford, C. Faloutsos

and G. Blelloch, in the SIAM 2004 Workshop on

Link analysis, counter-terrorism and privacy

Roadmap

- Specific applications
- Node grouping
- Viral propagation

- General issues
- Realistic graph generation
- Graph patterns and laws

2

4

Other Work

Conclusions

Experiments (Clickstream bipartite graph)

Some personal webpage

Clickstream

Count

Yahoo, Google and others

Websites

Users

In-degree

Experiments (Clickstream bipartite graph)

Email-checking surfers

Clickstream

Count

All-night surfers

Websites

Users

Out-degree

Experiments (Clickstream bipartite graph)

Clickstream R-MAT

Reachable pairs

Websites

Users

Hops

Graph Generation

- Important for
- Simulations of new algorithms
- Compression using a good graph generation model
- Insight into the graph formation process
- Our R-MAT (Recursive MATrix) generator can match

many common graph patterns.

Recall the definition of eigenvalues

?A eigenvalue of A ?1,A largest eigenvalue

A

X

X

?A

ß/d lt t 1/ ?1,A

Tools for Large Graph Mining

- Deepayan Chakrabarti
- Carnegie Mellon University