Structural Analysis in Large Networks Observations and Applications - PowerPoint PPT Presentation

About This Presentation
Title:

Structural Analysis in Large Networks Observations and Applications

Description:

... SNARE ... Community Tools: SNARE. 50. Belief Propagation. Flags are node potentials, or ' ... Community Tools: SNARE. 52. Accurate- Produces large improvement over ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 97
Provided by: marymc7
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: Structural Analysis in Large Networks Observations and Applications


1
Structural Analysis in Large NetworksObservations
and Applications
  • Mary McGlohon
  • Committee
  • Christos Faloutsos, co-chair
  • Alan Montgomery, co-chair
  • Geoffrey Gordon
  • David Jensen, University of Massachusetts, Amherst

2
Motivation
  • Network (a.k.a. graph, relational, social
    network) data has become ubiquitous. We want to
    know
  • How do networks form and structure themselves?
  • How does information propagate through networks?
  • How do sub-communities form?

1
2
3
Computer networks
Facebook
IMDB actor-movie
3
Outline for thesis
1
2
3
4
Motivation Topology
  • How do these network strucures form?
  • Example identify topological properties common
    to many different types of graphs (citations,
    friendships, etc.)
  • Developing models of these properties allows for
    forecasting.

1
vs
5
Motivation Cascades
  • Once the networks form, how does information
    propagate through the graph?
  • Example Extract, analyze, and model cascades.

2
6
Motivation Community
  • How do we compare communities, or sub-networks?
  • Example For a set of online groups (Usenet),
    which ones continue to thrive over time?

3
7
Thesis statement
  • We propose to
  • investigate how interactions in graphs occur, how
    these interactions lead to diffusion and
    community behavior, and
  • to model these behaviors and apply these findings
    to real-world problems.

1
2
3
8
We propose to
  • investigate how interactions in graphs occur,

to model these behaviors and apply these
findings to real-world problems.
  • how these interactions lead to diffusion
  • and community behavior, and

9
Impact
  • Understanding the relations found in networks has
    many applications, such as
  • Fraud/anomaly detection
  • Given typical behavior and information about
    nodes/edges, how suspicious is a node or group
    of nodes?
  • Ad personalization/recommendation systems
  • Given some information about an individual and
    their friends, which ads to display?
  • Resource allocation
  • Given typical patterns of network growth, how can
    we allocate resources (hardware, advertising
    budget, etc.)?

10
Completed Work
  • KDD08
  • ICDM08

SDM07
  • ICWSM07
  • ICWSM09-1
  • ICWSM09-2
  • ICWSM09-3
  • KDD09

- to appear
11
Proposed Work
P1a How do cascades compare across network
structures?
P1b Can we use cascades to model product
adoption?
P2 Can we predict success/failure of groups?
12
The rest of the talk
  • Motivation and thesis statement
  • Completed work
  • Proposed work
  • Conclusions and impact
  • Audience participation!

13
Completed Work
  • What patterns are common to networks?

14
Topological Observations
  • Diameter over time
  • Connected components

(Kevin Bacon)
  • Edge weights

15
Topological Observations Data
  • Analyze unipartite and bipartite networks
  • Networks are evolving over time
  • Networks may be weighted

-Repeated edges
-Edge weights
3
3
Unipartite Citations, Blogs, Router traffic
n1
Bipartite IMDB Actor-Movie, Campaign
contributions
m1
n2
m2
n3
m3
n4
16
Topological Observations Gelling Point
  • When does a graph begin displaying expected
    patterns, such as the giant connected component?
    How can we tell when this happens?

17
Topological Observations Gelling Point
  • Observation Most real graphs display a gelling
    point, where the graph begins to come together
    and the giant connected component forms. After
    that point, they exhibit typical behavior.

IMDB
t1914
Diameter
Time
18
Topological Observations NLCCs
  • In graphs a giant connected component emerges.
  • We look at sizes of the next-largest connected
    components (NLCCs)
  • After gelling point, do they continue to grow? Do
    they shrink?

19
Topological Observations NLCCs
  • Observation After the gelling point, the giant
    connected component takes off, but next-largest
    connected components remain constant or oscillate.

IMDB
t1914)
ia
2nd connected component
Size of next-largest connected components
3rd connected component
Time
20
Topological Observations Weights
  • How are edges in a graph repeated, or otherwise
    weighted?
  • As the number of edges increases, does the total
    edge weight grow linearly?

21
Topological Observations Weights
  • Observation Weight additions follow a power law
    with respect to the number of edges
  • W(t) ? E(t)w
  • W(t) total weight of graph at t
  • E(t) total edges of graph at t
  • w is PL exponent (wgt1)
  • Many other weighted laws
  • see KDD08, ICDM08

Orgs-Candidates
log(Weights)
slope1.3
log(Edges)
22
Completed Work
  • What patterns are common to networks?

23
Completed Work
  • Gelling point, CCs
  • Weighted laws

24
Completed Work
  • Gelling point, CCs
  • Weighted laws
  • Can we develop generative models?

25
Topological Models Butterfly
  • Goals are to generate
  • Constant/oscillating NLCCs
  • Densification power law Leskovec05
  • Shrinking diameter (after gelling point)
  • Power-law degree distribution
  • Emergent, local, intuitive behavior

26
Topological Models Butterfly
  • Main idea Uses 3 parameters
  • Curiosity how much to explore local network
    (U(0,1), creates power-law degree distribution)
  • Flyout how many local networks to explore
    (global, joins components)
  • Friendliness how often to connect (global,
    allows new components)
  • Details see KDD08

27
Topological Models Butterfly
28
Completed Work
  • Gelling point, CCs
  • Weighted laws
  • Can we develop generative models?

29
Completed Work
  • Gelling point, CCs
  • Weighted laws
  • Butterfly
  • RTM
  • Oddball

30
Completed Work
  • Gelling point, CCs
  • Weighted laws
  • Butterfly
  • RTM
  • Oddball
  • What are patterns of cascades in networks?

31
Cascade Observations Data
  • Gathered from August-September 2005
  • Used set of 44,362 blogs, traced cascades
  • 2.4 million posts
  • 245,404 blog-to-blog links

Sep 29
Aug 1
Number of posts
Jul 4
Time 1 day
32
Cascade Observations Prelims
a
b
c
d
e
Blogosphere
Star Chain
  • How quickly does a link to a post occur?
  • What size do cascades typically reach?
  • What are typical shapes how often are stars
    and chains occurring?

33
Temporal Observations
  • How quickly does a link to a post occur?
  • Does popularity decay at a constant rate?
  • With an exponential (half life)?

Linear-linear scale
Log-linear scale
Log-log scale
34
Cascade Observations Link Popularity
  • Observation The probability that a post written
    at time tp acquires a link at time tp ? is
  • p(tp?) ? ?-1.5
  • Similar to Vazquez06

35
Cascade Observations Cascade Size
  • Q What size distribution do cascades follow? Are
    large cascades frequent?
  • Observation The probability of observing a
    cascade of n blog posts follows a Zipf
    distribution
  • p(n) ? n-2

log(Count)
slope-2
log(Cascade size) ( of nodes)
36
Cascade Observations Cascade Size
  • Q What is the distribution of particular cascade
    shapes?
  • Observation Stars and chains in blog cascades
    also follow a power law, with different exponents
    (star -3.1, chain -8.5).

37
Completed Work
  • Gelling point, CCs
  • Weighted laws
  • Butterfly
  • RTM
  • Oddball
  • What are patterns of cascades in networks?

38
Completed Work
  • Gelling point, CCs
  • Weighted laws
  • Butterfly
  • RTM
  • Oddball
  • Cascades laws
  • Cascades as features

39
Completed Work
  • Gelling point, CCs
  • Weighted laws
  • Butterfly
  • RTM
  • Oddball
  • Cascades laws
  • Cascades as features
  • Can we develop predictive models for cascades?

40
Cascade Models CGM
  • Cascade Generation Model
  • Overview Produce realistic cascades through an
    emergent viral model
  • Details See SDM07

41
Cascade Models CGM
Most frequent cascades
model
data
42
Completed Work
  • Gelling point, CCs
  • Weighted laws
  • Butterfly
  • RTM
  • Oddball
  • Cascades laws
  • Cascades as features
  • Can we develop predictive models for cascades?

43
Completed Work
  • Gelling point, CCs
  • Weighted laws
  • Butterfly
  • RTM
  • Oddball
  • Cascades laws
  • Cascades as features
  • Cascade generation model
  • ZC model

44
Completed Work
  • Gelling point, CCs
  • Weighted laws
  • Butterfly
  • RTM
  • Oddball
  • Cascades laws
  • Cascades as features
  • Cascade generation model
  • ZC model
  • How can we compare communities?

45
Completed Work
  • Gelling point, CCs
  • Weighted laws
  • Butterfly
  • RTM
  • Oddball
  • Cascades laws
  • Cascades as features
  • Cascade generation model
  • ZC model
  • Political Usenet study

46
Completed Work
  • Gelling point, CCs
  • Weighted laws
  • Butterfly
  • RTM
  • Oddball
  • Cascades laws
  • Cascades as features
  • Cascade generation model
  • ZC model
  • Political Usenet study
  • Can we detect anomalies?

47
Community Tools SNARE
  • Problem Given a network and some domain
    knowledge about suspicious nodes (flags),
    determine which nodes are most risky.
  • Data Accounting transaction data. Nodes are
    accounts, edges are transactions between
    accounts.

Accounts Payable
Revenue Accts
Accounts Receivable
48
Community Tools SNARE
  • Example Channel stuffing
  • Some accounts overstated
  • But other accounts also involved.
  • Since many accounts are slightly affected, it is
    easy to cover up activity.

Very risky
Accounts Payable
Revenue Accts
Accounts Receivable
Not risky
49
Community Tools SNARE
  • Social Network Analytic Risk Evaluation
  • Use domain knowledge to flag certain nodes.
  • Assume homophily between nodes (guilt by
    association)
  • Then, using initial risk as initial node
    potentials, use belief propagation (message
    passing between nodes) to determine end risk
    scores.

50
Community Tools SNARE
  • Belief Propagation
  • Flags are node potentials, or intial risk
    scores
  • All nodes send messages back and forth with
    beliefs
  • Upon convergence, end result will reflect
    riskiest nodes.

Revenue Accts
51
Community Tools SNARE
  • Produces improvement over simply using flags
  • Up to 6.5 lift
  • Improvement especially for low false positive
    rate

Results for accounts data (ROC Curve)
Ideal
SNARE
True positive rate
Baseline (flags only)
False positive rate
52
Community Tools SNARE
  • Accurate- Produces large improvement over simply
    using flags
  • Flexible- Can be applied to other domains
  • Scalable- One iteration BP runs in linear time (
    edges)
  • Robust- Works on large range of parameters

53
Completed Work
  • Gelling point, CCs
  • Weighted laws
  • Butterfly
  • RTM
  • Oddball
  • Cascades laws
  • Cascades as features
  • Cascade generation model
  • ZC model
  • Political Usenet study
  • Can we detect anomalies?

54
Completed Work
  • Gelling point, CCs
  • Weighted laws
  • Butterfly
  • RTM
  • Oddball
  • Cascades laws
  • Cascades as features
  • Cascade generation model
  • ZC model
  • Political Usenet study
  • SNARE

55
The rest of the talk
  • Motivation and thesis statement
  • Completed work
  • Proposed work
  • Conclusions and impact
  • Audience participation!

56
Proposed Work
  • 2 main problems
  • P1 Cascades and product adoption
  • How do cascades vary according to network
    structure?
  • Can we use cascades to model product adoption?
  • P2 Predicting success/failure of online groups

57
  • P1a How do cascades compare across network
    structures?
  • P1b Can we use cascades to model product
    adoption?
  • P2 Can we predict success/failure of groups?

58
  • P1a How do cascades compare across network
    structures?
  • P1b Can we use cascades to model product
    adoption?
  • P2 Can we predict success/failure of groups?

59
P1a Cascades Network Structure
  • In different networks, how does starting point of
    an epidemic affect the epidemic size?
  • What modifications on current model changes the
    cascades (weights, self-infection)?
  • Can we reverse-engineer network properties based
    on observed cascades?

60
  • P1a How do cascades compare across network
    structures?
  • P1b Can we use cascades to model product
    adoption?
  • P2 Can we predict success/failure of groups?

61
P1b Cascades Product Adoption
  • Examine adoption of Caller Ringback Tones (CRBT)
  • User buys ringtone
  • Friend calls user, hears CRBT
  • Phone call data
  • Nodes User ID, DOB, salutation (Mr/Ms), date of
    joining, data plan
  • Call Edges src/dest ID, call time, duration
  • SMS Edges src/dest ID, time
  • CRBT purchases purchase date, song name, cost

62
P1b Cascades Product Adoption
  • Can we fit the Bass Model for different CRBTs?

63
P1b Cascades Product Adoption
  • Are some CRBTs more viral than others? Does
    the footprint follow a skewed distribution?
  • How long after purchase is a CRBT infective?

Survival Function P(Xgtx)
Number of downloads (per song)
64
P1b Cascades Product Adoption
  • How does the weight of a link, homophily, or
    other factors affect the likelihood of
    transmission?
  • Can we explicitly test whether a purchase is a
    result of basic similarity of neighbors or a
    result of viral propagation?
  • How can we build and verify a model for this
    propagation?

65
  • P1a How do cascades compare across network
    structures?
  • P1b Can we use cascades to model product
    adoption?
  • P2 Can we predict success/failure of groups?

66
P2 Success Failure of Online Groups
  • Use data over 4 years from nearly 200 newsgroups.
    (Political Usenet)
  • Many discussion groups stop posting by the third
    year.
  • Why?

67
P2 Success Failure of Online Groups
  • P2 Questions
  • If structural network characteristics can be
    traced to success or failure, which features are
    most predictive?
  • Can we test causality in the predictive
    characteristics?

68
Timeline
May 09
P1 preliminaries
Jun 09
Internship at Google
Sep 09
P1a Cascades and network structure
Nov 09
P1b Cascades and product adoption
Mar 10
P2 Success/failure of online groups
Jul 10
Complete document
Aug 10
Defend
69
Related work
  • Topology
  • Heavy-tailed degree distributions Faloutsos99
    Albert02 Kleinberg99
  • Shrinking diameter, densification Leskovec05
  • Random graphs model Erdos60
  • Forest Fire model Leskovec05
  • Winners do not take all model Pennock02
  • Cascades
  • Recommendations Leskovec06
  • Diffusion in blogs Adar03 Gruhl04
    Kempe03 Kumar03
  • Marketing Product adoption Bass69,
    Word-of-mouth Godes04
  • Virus propagation Populations Hethcote,
    Networks Boguna, Pastor-Satorras Charkabarti
  • Communities and other applications
  • Securities fraud detection Neville05 Fast07
  • Author identification Hill04
  • Online group behavior Backstrom08

70
Conclusions Completed
  • Demonstrated several properties common to
    networks in a wide range of domains.
  • Oscillating sizes of next-largest connected
    components
  • Power laws for weighted graphs
  • Butterfly model generates properties

71
Conclusions Completed
  • Studied and modeled cascades in blogs
  • Several power laws for cascade shapes and size
  • Cascade Generation Model
  • Devised SNARE for anomaly detection for
    accounting data (lift factor up to 6.5)

72
Conclusions Proposed
  • P1a Continue cascade studies across network
    structures
  • P1b Use cascades to model purchases in
    phone-call graph
  • P2 Build predictive models for success and
    failure in online groups

73
References
  • Topology
  • KDD08 M. McGlohon, L. Akoglu, and C. Faloutsos.
    Weighted Graphs and Disconnected Components
    Patterns and a Generator. SIG-KDD. Las Vegas,
    Nev., August 2008.
  • ICDM08 L. Akoglu. M. McGlohon, and C.
    Faloutsos. RTM Laws and a Recursive Generator
    for Weighted Time-Evolving Graphs. ICDM. Pisa,
    Italy, Dec. 2008.
  • Cascades
  • SDM07 J. Leskovec, J, M. McGlohon, C.
    Faloutsos, N. Glance, and M. Hurst. Patterns of
    Cascading Behavior in Large Blog Graphs. SDM.
    Minneapolis, Minn., April 2007.
  • ICWSM07 M. McGlohon, J. Leskovec, C. Faloutsos,
    N. Glance, and M. Hurst. Finding patterns in blog
    shapes and blog evolution. ICWSM. Boulder, Colo.,
    March 2007.
  • ICWSM09-1 M. Goetz, J. Leskovec, M. McGlohon,
    and C. Faloutsos. Modeling Blog Dynamics. ICWSM.
    San Jose, Cali. May 2009.

74
References
  • Community
  • KDD09 M. McGlohon, S. Bay, M. Anderle, D.
    Steier, and C. Faloutsos. SNARE A Link Analytic
    System for Evaluating Fraud Risk. ACM Special
    Interest Group on Knowledge Discovery and Data
    Mining (SIG-KDD). Paris, France. June 2009.
  • ICWSM09-2 M. McGlohon and M. Hurst. Community
    Structure and Information Flow in Usenet
    Improving analysis with a thread ownership model.
    International Conference on Weblogs and Social
    Media (ICWSM). San Jose, CA. May 2009.
  • ICWSM09-3 M. McGlohon and M. Hurst. Considering
    the Sources Comparing linking patterns in Usenet
    and blogs. International Conference on Weblogs
    and Social Media (ICWSM09). San Jose, CA. May
    2009.

75
  • Acknowledgments
  • Leman Akoglu
  • Markus Anderle
  • Stephen Bay
  • Polo Chau
  • Christos Faloutsos
  • Natalie Glance
  • Mila Goetz
  • Geoff Gordon
  • Matthew Hurst
  • i-Lab
  • David Jensen
  • Ramayya Krishnan
  • Jure Leskovec
  • Austin McDonald
  • Alan Montgomery
  • Chris Neff
  • Nachi Sahoo
  • Purna Sarkar
  • Support
  • PricewaterhouseCoopers
  • Microsoft Live Labs
  • NSF Graduate Research Fellowship
  • Yahoo! Key Technical Challenges Grant,
    Pennsylvania Infrastrucutre Technology Alliance
    (PITA)
  • Hewlett-Packard
  • NSF Grants No. IIS- 0705359, IIS-0534205, and
    CNS-0721736, 0209107, SENSOR-0329549, EF-0331657,
    IIS-0326322
  • U.S. Department of Energy Lawrence Livermore
    National Laboratory contract No.W-7405-ENG-48.

76
Audience participation!
77
(No Transcript)
78
Talk expansion pack
79
P1b Other Cascade Data
  • Post data from corporate blogs
  • Demographic data on bloggers (employee ID,
    location, job description)
  • Read data (timestamped)
  • Write data (timestamped)
  • CRBT adoption in general
  • Perhaps people do not adopt particular songs, but
    the CRBT mechanism
  • More public blog data (spinn3r)
  • Also use edge information from blogrolls/comments

80
P2 Potential features to examine
  • Posting behavior
  • Which users are posting, how often are they
    posting, and how skewed is the distribution?
  • Linking behavior
  • How long are cascades (threads), in terms of post
    and time?
  • Content
  • Topics, keywords, sentence length, other textual
    features, sentiment analysis

81
Unipartite Networks
  • Postnet Posts in blogs, hyperlinks between
  • Blognet Aggregated Postnet, repeated edges
  • Patent Patent citations
  • NIPS Academic citations
  • Arxiv Academic citations
  • NetTraffic Packets, repeated edges
  • Autonomous Systems (AS) Packets, repeated edges

4 million nodes 8 million edges 17 years
82
Bipartite Networks
  • IMDB Actor-movie network
  • Netflix User-movie ratings
  • DBLP conference- repeated edges
  • Author-Keyword
  • Keyword-Conference
  • Author-Conference
  • US Election Donations weights, repeated edges
  • Orgs-Candidates
  • Individuals-Orgs

6 million nodes 10 million edges 22 years
83
Topological Models Butterfly
84
Topological Models Butterfly
  • Nodes may have multiple hosts ( ).
  • Joins components

85
Topological Models RTM
  • Recursive Tensor Model
  • Goal to introduce time and burstiness
  • Main idea Begin with a core tensor
    (multidimensional array), and use self-similarity
    to reproduce observed power laws.

86
Topological Models RTM
  • Self similarity arises from Kronecker product
  • 2D

Leskovec06
87
Topological Models RTM
  • 3D Use Kronecker product on a core tensor
  • Reproduced power laws as found in ICDM08

Adjacency matrix
88
Topological Models RTM
  • 3D Use Kronecker product on a core tensor
  • Reproduced power laws as found in ICDM08

3rd dim time
89
Topological Applications Oddball
  • Main ideas
  • Use local neighborhood of node
  • Find common patterns
  • Score how much a node deviates from common
    patterns
  • Results
  • Identified anomalous nodes such as Ken Lay in
    Enron, particularly different blog posts

90
Cascade Models CGM
91
Cascade Models Zero-crossing
  • Main ideas
  • Models blogs in both network growth and network
    diffusion
  • Choose to post based on random walk (produces
    burstiness)
  • Link based on recency an popularity (reproduces
    -1.5 law and skewed degree)
  • Improvement over CGM because network is generated

92
Community Observations Newsgroups
  • Observation Threads introduced to a group later
    in the thread tended to have more activity from
    that group.
  • Observation Discussions tended to flow from
    main groups (can.politics) into subgroups
    (ab.politics, bc.politics)

93
Community Observations Newsgroups
  • 189 newsgroups (polit in name), January
    2004-June 2008
  • 37 million posts
  • Includes many countries, provinces, states,
    topical groups (alt.politics.guns)

Major issue over half are cross-posted to
multiple groups. Where is conversation truly
occurring?
94
Community Observations Newsgroups
  • Solution Introduce Thread ownership, by
    assigning threads according to where authors
    exclusively post.

95
Community Observations Newsgroups
  • Observation Discussions tended to flow from
    main groups (can.politics) into subgroups
    (ab.politics, bc.politics)

96
Completed Work
  • What patterns are common to networks?
  • Can we develop generative models and detect
    anomalies?
  • What are patterns of cascades in networks?
  • Can we develop predictive models for cascades?
  • How can we compare communities?
  • Can we detect anomalies, and predict group
    behavior?
Write a Comment
User Comments (0)
About PowerShow.com