Structural Analysis in Large NetworksObservations

and Applications

- Mary McGlohon
- Committee
- Christos Faloutsos, co-chair
- Alan Montgomery, co-chair
- Geoffrey Gordon
- David Jensen, University of Massachusetts, Amherst

Motivation

- Network (a.k.a. graph, relational, social

network) data has become ubiquitous. We want to

know - How do networks form and structure themselves?
- How does information propagate through networks?
- How do sub-communities form?

1

2

3

Computer networks

IMDB actor-movie

Outline for thesis

1

2

3

Motivation Topology

- How do these network strucures form?
- Example identify topological properties common

to many different types of graphs (citations,

friendships, etc.) - Developing models of these properties allows for

forecasting.

1

vs

Motivation Cascades

- Once the networks form, how does information

propagate through the graph? - Example Extract, analyze, and model cascades.

2

Motivation Community

- How do we compare communities, or sub-networks?
- Example For a set of online groups (Usenet),

which ones continue to thrive over time?

3

Thesis statement

- We propose to
- investigate how interactions in graphs occur, how

these interactions lead to diffusion and

community behavior, and - to model these behaviors and apply these findings

to real-world problems.

1

2

3

We propose to

- investigate how interactions in graphs occur,

to model these behaviors and apply these

findings to real-world problems.

- how these interactions lead to diffusion

- and community behavior, and

Impact

- Understanding the relations found in networks has

many applications, such as - Fraud/anomaly detection
- Given typical behavior and information about

nodes/edges, how suspicious is a node or group

of nodes? - Ad personalization/recommendation systems
- Given some information about an individual and

their friends, which ads to display? - Resource allocation
- Given typical patterns of network growth, how can

we allocate resources (hardware, advertising

budget, etc.)?

Completed Work

- KDD08
- ICDM08

SDM07

- ICWSM07

- ICWSM09-1

- ICWSM09-2
- ICWSM09-3

- KDD09

- to appear

Proposed Work

P1a How do cascades compare across network

structures?

P1b Can we use cascades to model product

adoption?

P2 Can we predict success/failure of groups?

The rest of the talk

- Motivation and thesis statement
- Completed work
- Proposed work
- Conclusions and impact
- Audience participation!

Completed Work

- What patterns are common to networks?

Topological Observations

- Diameter over time

- Connected components

(Kevin Bacon)

- Edge weights

Topological Observations Data

- Analyze unipartite and bipartite networks
- Networks are evolving over time
- Networks may be weighted

-Repeated edges

-Edge weights

3

3

Unipartite Citations, Blogs, Router traffic

n1

Bipartite IMDB Actor-Movie, Campaign

contributions

m1

n2

m2

n3

m3

n4

Topological Observations Gelling Point

- When does a graph begin displaying expected

patterns, such as the giant connected component?

How can we tell when this happens?

Topological Observations Gelling Point

- Observation Most real graphs display a gelling

point, where the graph begins to come together

and the giant connected component forms. After

that point, they exhibit typical behavior.

IMDB

t1914

Diameter

Time

Topological Observations NLCCs

- In graphs a giant connected component emerges.
- We look at sizes of the next-largest connected

components (NLCCs) - After gelling point, do they continue to grow? Do

they shrink?

Topological Observations NLCCs

- Observation After the gelling point, the giant

connected component takes off, but next-largest

connected components remain constant or oscillate.

IMDB

t1914)

ia

2nd connected component

Size of next-largest connected components

3rd connected component

Time

Topological Observations Weights

- How are edges in a graph repeated, or otherwise

weighted? - As the number of edges increases, does the total

edge weight grow linearly?

Topological Observations Weights

- Observation Weight additions follow a power law

with respect to the number of edges - W(t) ? E(t)w
- W(t) total weight of graph at t
- E(t) total edges of graph at t
- w is PL exponent (wgt1)
- Many other weighted laws
- see KDD08, ICDM08

Orgs-Candidates

log(Weights)

slope1.3

log(Edges)

Completed Work

- What patterns are common to networks?

Completed Work

- Gelling point, CCs
- Weighted laws

Completed Work

- Gelling point, CCs
- Weighted laws

- Can we develop generative models?

Topological Models Butterfly

- Goals are to generate
- Constant/oscillating NLCCs
- Densification power law Leskovec05
- Shrinking diameter (after gelling point)
- Power-law degree distribution
- Emergent, local, intuitive behavior

Topological Models Butterfly

- Main idea Uses 3 parameters
- Curiosity how much to explore local network

(U(0,1), creates power-law degree distribution) - Flyout how many local networks to explore

(global, joins components) - Friendliness how often to connect (global,

allows new components) - Details see KDD08

Topological Models Butterfly

Completed Work

- Gelling point, CCs
- Weighted laws

- Can we develop generative models?

Completed Work

- Gelling point, CCs
- Weighted laws

- Butterfly
- RTM
- Oddball

Completed Work

- Gelling point, CCs
- Weighted laws

- Butterfly
- RTM
- Oddball

- What are patterns of cascades in networks?

Cascade Observations Data

- Gathered from August-September 2005
- Used set of 44,362 blogs, traced cascades
- 2.4 million posts
- 245,404 blog-to-blog links

Sep 29

Aug 1

Number of posts

Jul 4

Time 1 day

Cascade Observations Prelims

a

b

c

d

e

Blogosphere

Star Chain

- How quickly does a link to a post occur?
- What size do cascades typically reach?
- What are typical shapes how often are stars

and chains occurring?

Temporal Observations

- How quickly does a link to a post occur?
- Does popularity decay at a constant rate?
- With an exponential (half life)?

Linear-linear scale

Log-linear scale

Log-log scale

Cascade Observations Link Popularity

- Observation The probability that a post written

at time tp acquires a link at time tp ? is - p(tp?) ? ?-1.5
- Similar to Vazquez06

Cascade Observations Cascade Size

- Q What size distribution do cascades follow? Are

large cascades frequent? - Observation The probability of observing a

cascade of n blog posts follows a Zipf

distribution - p(n) ? n-2

log(Count)

slope-2

log(Cascade size) ( of nodes)

Cascade Observations Cascade Size

- Q What is the distribution of particular cascade

shapes? - Observation Stars and chains in blog cascades

also follow a power law, with different exponents

(star -3.1, chain -8.5).

Completed Work

- Gelling point, CCs
- Weighted laws

- Butterfly
- RTM
- Oddball

- What are patterns of cascades in networks?

Completed Work

- Gelling point, CCs
- Weighted laws

- Butterfly
- RTM
- Oddball

- Cascades laws
- Cascades as features

Completed Work

- Gelling point, CCs
- Weighted laws

- Butterfly
- RTM
- Oddball

- Cascades laws
- Cascades as features

- Can we develop predictive models for cascades?

Cascade Models CGM

- Cascade Generation Model
- Overview Produce realistic cascades through an

emergent viral model - Details See SDM07

Cascade Models CGM

Most frequent cascades

model

data

Completed Work

- Gelling point, CCs
- Weighted laws

- Butterfly
- RTM
- Oddball

- Cascades laws
- Cascades as features

- Can we develop predictive models for cascades?

Completed Work

- Gelling point, CCs
- Weighted laws

- Butterfly
- RTM
- Oddball

- Cascades laws
- Cascades as features

- Cascade generation model
- ZC model

Completed Work

- Gelling point, CCs
- Weighted laws

- Butterfly
- RTM
- Oddball

- Cascades laws
- Cascades as features

- Cascade generation model
- ZC model

- How can we compare communities?

Completed Work

- Gelling point, CCs
- Weighted laws

- Butterfly
- RTM
- Oddball

- Cascades laws
- Cascades as features

- Cascade generation model
- ZC model

- Political Usenet study

Completed Work

- Gelling point, CCs
- Weighted laws

- Butterfly
- RTM
- Oddball

- Cascades laws
- Cascades as features

- Cascade generation model
- ZC model

- Political Usenet study

- Can we detect anomalies?

Community Tools SNARE

- Problem Given a network and some domain

knowledge about suspicious nodes (flags),

determine which nodes are most risky. - Data Accounting transaction data. Nodes are

accounts, edges are transactions between

accounts.

Accounts Payable

Revenue Accts

Accounts Receivable

Community Tools SNARE

- Example Channel stuffing
- Some accounts overstated
- But other accounts also involved.
- Since many accounts are slightly affected, it is

easy to cover up activity.

Very risky

Accounts Payable

Revenue Accts

Accounts Receivable

Not risky

Community Tools SNARE

- Social Network Analytic Risk Evaluation
- Use domain knowledge to flag certain nodes.
- Assume homophily between nodes (guilt by

association) - Then, using initial risk as initial node

potentials, use belief propagation (message

passing between nodes) to determine end risk

scores.

Community Tools SNARE

- Belief Propagation
- Flags are node potentials, or intial risk

scores - All nodes send messages back and forth with

beliefs - Upon convergence, end result will reflect

riskiest nodes.

Revenue Accts

Community Tools SNARE

- Produces improvement over simply using flags
- Up to 6.5 lift
- Improvement especially for low false positive

rate

Results for accounts data (ROC Curve)

Ideal

SNARE

True positive rate

Baseline (flags only)

False positive rate

Community Tools SNARE

- Accurate- Produces large improvement over simply

using flags - Flexible- Can be applied to other domains
- Scalable- One iteration BP runs in linear time (

edges) - Robust- Works on large range of parameters

Completed Work

- Gelling point, CCs
- Weighted laws

- Butterfly
- RTM
- Oddball

- Cascades laws
- Cascades as features

- Cascade generation model
- ZC model

- Political Usenet study

- Can we detect anomalies?

Completed Work

- Gelling point, CCs
- Weighted laws

- Butterfly
- RTM
- Oddball

- Cascades laws
- Cascades as features

- Cascade generation model
- ZC model

- Political Usenet study

- SNARE

The rest of the talk

- Motivation and thesis statement
- Completed work
- Proposed work
- Conclusions and impact
- Audience participation!

Proposed Work

- 2 main problems
- P1 Cascades and product adoption
- How do cascades vary according to network

structure? - Can we use cascades to model product adoption?
- P2 Predicting success/failure of online groups

- P1a How do cascades compare across network

structures?

- P1b Can we use cascades to model product

adoption?

- P2 Can we predict success/failure of groups?

- P1a How do cascades compare across network

structures?

- P1b Can we use cascades to model product

adoption?

- P2 Can we predict success/failure of groups?

P1a Cascades Network Structure

- In different networks, how does starting point of

an epidemic affect the epidemic size? - What modifications on current model changes the

cascades (weights, self-infection)? - Can we reverse-engineer network properties based

on observed cascades?

- P1a How do cascades compare across network

structures?

- P1b Can we use cascades to model product

adoption?

- P2 Can we predict success/failure of groups?

P1b Cascades Product Adoption

- Examine adoption of Caller Ringback Tones (CRBT)
- User buys ringtone
- Friend calls user, hears CRBT
- Phone call data
- Nodes User ID, DOB, salutation (Mr/Ms), date of

joining, data plan - Call Edges src/dest ID, call time, duration
- SMS Edges src/dest ID, time
- CRBT purchases purchase date, song name, cost

P1b Cascades Product Adoption

- Can we fit the Bass Model for different CRBTs?

P1b Cascades Product Adoption

- Are some CRBTs more viral than others? Does

the footprint follow a skewed distribution? - How long after purchase is a CRBT infective?

Survival Function P(Xgtx)

Number of downloads (per song)

P1b Cascades Product Adoption

- How does the weight of a link, homophily, or

other factors affect the likelihood of

transmission? - Can we explicitly test whether a purchase is a

result of basic similarity of neighbors or a

result of viral propagation? - How can we build and verify a model for this

propagation?

- P1a How do cascades compare across network

structures?

- P1b Can we use cascades to model product

adoption?

- P2 Can we predict success/failure of groups?

P2 Success Failure of Online Groups

- Use data over 4 years from nearly 200 newsgroups.

(Political Usenet) - Many discussion groups stop posting by the third

year. - Why?

P2 Success Failure of Online Groups

- P2 Questions
- If structural network characteristics can be

traced to success or failure, which features are

most predictive? - Can we test causality in the predictive

characteristics?

Timeline

May 09

P1 preliminaries

Jun 09

Internship at Google

Sep 09

P1a Cascades and network structure

Nov 09

P1b Cascades and product adoption

Mar 10

P2 Success/failure of online groups

Jul 10

Complete document

Aug 10

Defend

Related work

- Topology
- Heavy-tailed degree distributions Faloutsos99

Albert02 Kleinberg99 - Shrinking diameter, densification Leskovec05
- Random graphs model Erdos60
- Forest Fire model Leskovec05
- Winners do not take all model Pennock02
- Cascades
- Recommendations Leskovec06
- Diffusion in blogs Adar03 Gruhl04

Kempe03 Kumar03 - Marketing Product adoption Bass69,

Word-of-mouth Godes04 - Virus propagation Populations Hethcote,

Networks Boguna, Pastor-Satorras Charkabarti - Communities and other applications
- Securities fraud detection Neville05 Fast07
- Author identification Hill04
- Online group behavior Backstrom08

Conclusions Completed

- Demonstrated several properties common to

networks in a wide range of domains. - Oscillating sizes of next-largest connected

components - Power laws for weighted graphs
- Butterfly model generates properties

Conclusions Completed

- Studied and modeled cascades in blogs
- Several power laws for cascade shapes and size
- Cascade Generation Model
- Devised SNARE for anomaly detection for

accounting data (lift factor up to 6.5)

Conclusions Proposed

- P1a Continue cascade studies across network

structures - P1b Use cascades to model purchases in

phone-call graph - P2 Build predictive models for success and

failure in online groups

References

- Topology
- KDD08 M. McGlohon, L. Akoglu, and C. Faloutsos.

Weighted Graphs and Disconnected Components

Patterns and a Generator. SIG-KDD. Las Vegas,

Nev., August 2008. - ICDM08 L. Akoglu. M. McGlohon, and C.

Faloutsos. RTM Laws and a Recursive Generator

for Weighted Time-Evolving Graphs. ICDM. Pisa,

Italy, Dec. 2008. - Cascades
- SDM07 J. Leskovec, J, M. McGlohon, C.

Faloutsos, N. Glance, and M. Hurst. Patterns of

Cascading Behavior in Large Blog Graphs. SDM.

Minneapolis, Minn., April 2007. - ICWSM07 M. McGlohon, J. Leskovec, C. Faloutsos,

N. Glance, and M. Hurst. Finding patterns in blog

shapes and blog evolution. ICWSM. Boulder, Colo.,

March 2007. - ICWSM09-1 M. Goetz, J. Leskovec, M. McGlohon,

and C. Faloutsos. Modeling Blog Dynamics. ICWSM.

San Jose, Cali. May 2009.

References

- Community
- KDD09 M. McGlohon, S. Bay, M. Anderle, D.

Steier, and C. Faloutsos. SNARE A Link Analytic

System for Evaluating Fraud Risk. ACM Special

Interest Group on Knowledge Discovery and Data

Mining (SIG-KDD). Paris, France. June 2009. - ICWSM09-2 M. McGlohon and M. Hurst. Community

Structure and Information Flow in Usenet

Improving analysis with a thread ownership model.

International Conference on Weblogs and Social

Media (ICWSM). San Jose, CA. May 2009. - ICWSM09-3 M. McGlohon and M. Hurst. Considering

the Sources Comparing linking patterns in Usenet

and blogs. International Conference on Weblogs

and Social Media (ICWSM09). San Jose, CA. May

2009.

- Acknowledgments
- Leman Akoglu
- Markus Anderle
- Stephen Bay
- Polo Chau
- Christos Faloutsos
- Natalie Glance
- Mila Goetz
- Geoff Gordon
- Matthew Hurst
- i-Lab
- David Jensen
- Ramayya Krishnan
- Jure Leskovec
- Austin McDonald
- Alan Montgomery
- Chris Neff
- Nachi Sahoo
- Purna Sarkar

- Support
- PricewaterhouseCoopers
- Microsoft Live Labs
- NSF Graduate Research Fellowship
- Yahoo! Key Technical Challenges Grant,

Pennsylvania Infrastrucutre Technology Alliance

(PITA) - Hewlett-Packard
- NSF Grants No. IIS- 0705359, IIS-0534205, and

CNS-0721736, 0209107, SENSOR-0329549, EF-0331657,

IIS-0326322 - U.S. Department of Energy Lawrence Livermore

National Laboratory contract No.W-7405-ENG-48.

Audience participation!

(No Transcript)

Talk expansion pack

P1b Other Cascade Data

- Post data from corporate blogs
- Demographic data on bloggers (employee ID,

location, job description) - Read data (timestamped)
- Write data (timestamped)
- CRBT adoption in general
- Perhaps people do not adopt particular songs, but

the CRBT mechanism - More public blog data (spinn3r)
- Also use edge information from blogrolls/comments

P2 Potential features to examine

- Posting behavior
- Which users are posting, how often are they

posting, and how skewed is the distribution? - Linking behavior
- How long are cascades (threads), in terms of post

and time? - Content
- Topics, keywords, sentence length, other textual

features, sentiment analysis

Unipartite Networks

- Postnet Posts in blogs, hyperlinks between
- Blognet Aggregated Postnet, repeated edges
- Patent Patent citations
- NIPS Academic citations
- Arxiv Academic citations
- NetTraffic Packets, repeated edges
- Autonomous Systems (AS) Packets, repeated edges

4 million nodes 8 million edges 17 years

Bipartite Networks

- IMDB Actor-movie network
- Netflix User-movie ratings
- DBLP conference- repeated edges
- Author-Keyword
- Keyword-Conference
- Author-Conference
- US Election Donations weights, repeated edges
- Orgs-Candidates
- Individuals-Orgs

6 million nodes 10 million edges 22 years

Topological Models Butterfly

Topological Models Butterfly

- Nodes may have multiple hosts ( ).
- Joins components

Topological Models RTM

- Recursive Tensor Model
- Goal to introduce time and burstiness
- Main idea Begin with a core tensor

(multidimensional array), and use self-similarity

to reproduce observed power laws.

Topological Models RTM

- Self similarity arises from Kronecker product
- 2D

Leskovec06

Topological Models RTM

- 3D Use Kronecker product on a core tensor
- Reproduced power laws as found in ICDM08

Adjacency matrix

Topological Models RTM

- 3D Use Kronecker product on a core tensor
- Reproduced power laws as found in ICDM08

3rd dim time

Topological Applications Oddball

- Main ideas
- Use local neighborhood of node
- Find common patterns
- Score how much a node deviates from common

patterns - Results
- Identified anomalous nodes such as Ken Lay in

Enron, particularly different blog posts

Cascade Models CGM

Cascade Models Zero-crossing

- Main ideas
- Models blogs in both network growth and network

diffusion - Choose to post based on random walk (produces

burstiness) - Link based on recency an popularity (reproduces

-1.5 law and skewed degree) - Improvement over CGM because network is generated

Community Observations Newsgroups

- Observation Threads introduced to a group later

in the thread tended to have more activity from

that group. - Observation Discussions tended to flow from

main groups (can.politics) into subgroups

(ab.politics, bc.politics)

Community Observations Newsgroups

- 189 newsgroups (polit in name), January

2004-June 2008 - 37 million posts
- Includes many countries, provinces, states,

topical groups (alt.politics.guns)

Major issue over half are cross-posted to

multiple groups. Where is conversation truly

occurring?

Community Observations Newsgroups

- Solution Introduce Thread ownership, by

assigning threads according to where authors

exclusively post.

Community Observations Newsgroups

- Observation Discussions tended to flow from

main groups (can.politics) into subgroups

(ab.politics, bc.politics)

Completed Work

- What patterns are common to networks?

- Can we develop generative models and detect

anomalies?

- What are patterns of cascades in networks?

- Can we develop predictive models for cascades?

- How can we compare communities?

- Can we detect anomalies, and predict group

behavior?