Exploring Blog Networks - PowerPoint PPT Presentation

About This Presentation
Title:

Exploring Blog Networks

Description:

Blogs cite one another, creating a record of how information and ideas spread ... Understanding how the blog network works is important for: ... – PowerPoint PPT presentation

Number of Views:75
Avg rating:3.0/5.0
Slides: 112
Provided by: csC76
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: Exploring Blog Networks


1
Exploring Blog Networks
  • Patterns and a Model for Information Propagation

(As seen at SIAM- Data Mining 2007)
Mary McGlohon In collaboration with Jure
Leskovec, Christos Faloutsos Natalie Glance,
Matthew Hurst Sandia National Labs- July 6, 2007
2
Long-term Goals
  • How does information on the Web propagate?
  • With what pattern do ideas catch on, diffuse, and
    decrease in popularity?
  • Can we build a model for this propagation?

3
Why blogs?
  • Blogs are a widely used medium of information for
    many topics and have become an important mode of
    communication.
  • Blogs cite one another, creating a record of how
    information and ideas spread through a social
    network.
  • This record is publicly available.

4
Why do we care?
  • Understanding how the blog network works is
    important for
  • Social issues Political mapping, social trends
    and change, reactions to mass media.
  • Economic issues Marketing, predicting commercial
    success, discovering links between companies.

Example blogs in the 2004 election. Adamic,
Glance 2005
5
Immediate Goals
  • Temporal questions Does popularity have
    half-life? Is there periodicity?
  • Topological questions What topological patterns
    do posts and blogs follow? What shapes do
    cascades take on? Stars? Chains? Something
    else?
  • Generative model Can we build a generative model
    that mimics properties of cascades?

6
Outline
  • Motivation
  • Preliminaries
  • Concepts and terminology
  • Data
  • Temporal Observations
  • Topological Observations
  • Cascade Generation Model
  • Discussion Conclusions

7
What is a blog?
  • A blog is a frequently-updated webpage.
  • A blogs author updates the blog using posts.
  • Each post has a permanent hyperlink, and may
    contain links to other blog posts.

slashdot
boingboing
8
What is a blog?
  • A blog is a frequently-updated webpage.
  • A blogs author updates the blog using posts.
  • Each post has a permanent hyperlink, and may
    contain links to other blog posts.

The iPhone is here, hooray!
slashdot
boingboing
9
What is a blog?
  • A blog is a frequently-updated webpage.
  • A blogs author updates the blog using posts.
  • Each post has a permanent hyperlink, and may
    contain links to other blog posts.

The iPhone is here, hooray!
At this link, Slashdot says the iPhone has
arrived. But Im not buying one, because
slashdot
boingboing
10
What is a blog?
  • A blog is a frequently-updated webpage.
  • A blogs author updates the blog using posts.
  • Each post has a permanent hyperlink, and may
    contain links to other blog posts.

The iPhone is here, hooray!
At this link, Slashdot says the iPhone has
arrived. But Im not buying one, because
Here Boingboing says theyre not buying an
iPhone. Theyre just jealous.
slashdot
boingboing
11
From blogs to networks
slashdot
boingboing
MichelleMalkin
Dlisted
  • Blogosphere network

slashdot
boingboing
1
MichelleMalkin
Dlisted
Blog network Post network
Non-trivial vs. trivial cascades Stars vs.
chains Nodes a,b,c,d are cascade initiators e is
a connector
Cascades
12
From networks to cascades
slashdot
boingboing
Non-trivial vs. trivial cascades
MichelleMalkin
Dlisted
  • Blogosphere network

Cascades
13
From networks to cascades
slashdot
boingboing
Non-trivial vs. trivial cascades Cascade
initiators are first sources of information We
also have stars and chains
MichelleMalkin
Dlisted
  • Blogosphere network

Cascades
14
Dataset (Nielsen Buzzmetrics)
  • Gathered from August-September 2005
  • Used set of 44,362 blogs, traced cascades
  • 2.4 million posts, 5 million out-links, 245,404
    blog-to-blog links

Number of posts
Time 1 day
15
Outline
  • Motivation
  • Preliminaries
  • Concepts and terminology
  • Data
  • Temporal Observations
  • Does blog traffic behave periodically?
  • How does popularity change over time?
  • Topological Observations
  • Cascade Generation Model
  • Discussion Conclusions
  • Future Work

16
Temporal Observations
  • Does blog traffic behave periodically?
  • Posts have weekend effect, less traffic on
    Saturday/Sunday.

17
Temporal Observations
  • Does blog traffic behave periodically?
  • Monday appears to compensate for this behavior,
    but it is not actually the case.
  • We normalize data countnorm count / pd
  • where pd is percentage of links on that day.

Number in-links (log)
Number in-links (log)
Monday post dropoff- days after post
Same data, normalized
18
Temporal Observations
  • How does post popularity change over time?
  • Post popularity dropoff follows a power law
    identical to that found in communication response
    times in Vazquez2006.

Observation 1 The probability that a post
written at time tp acquires a link at time tp ?
is p(tp?) ? ?1.5
Number of in-links
Days after post
Cascades
19
Outline
  • Motivation
  • Preliminaries
  • Temporal Observations
  • Does blog traffic behave periodically?
  • How does post popularity change over time?
  • Topological Observations
  • What are graph properties for blog networks?
  • What shapes do cascades take on? Stars, chains,
    or something else?
  • Cascade Generation Model
  • Discussion Conclusions
  • Future Work

20
Topological Observations
  • What graph properties does the blog network
    exhibit?

21
Topological Observations
  • What graph properties does the blog network
    exhibit? How connected?
  • 44,356 nodes, 122,153 edges
  • Half of blogs belong to largest connected
    component.

22
Topological Observations
  • What power laws does the blog network exhibit?

Count (log scale)
Count (log scale)
Number of blog in-links (log scale)
Number of blog out-links (log scale)
Both in- and out-degree follows a power law
distribution, in-link PL exponent -1.7,
out-degree PL exponent near -3. This suggests
strong rich-get-richer phenomena.
23
Topological Observations
  • How are blog in- and out-degree related?

In-links and out-links are not correlated.
(correlation coefficient 0.16)
Number of blog out-links (log scale)
Number of blog in-links (log scale)
24
Topological Observations
What graph properties does the post network
exhibit?
25
Topological Observations
What graph properties does the post network
exhibit?
  • Very sparsely connected 98 of posts are
    isolated.

26
Topological Observations
What power laws does the post network exhibit?
  • Both in-and out-degree follow power laws
  • In-degree has PL exponent -2.15, out-degree has
    PL exponent -2.95.

Count
Count
Post in-degree
Post out-degree
27
Topological Observations
How do we measure how information flows through
the network?
  • We gather cascades using the following procedure
  • Find all initiators (out-degree 0).

a
b
c
d
e
28
Topological Observations
How do we measure how information flows through
the network?
  • We gather cascades using the following procedure
  • Find all initiators (out-degree 0).
  • Follow in-links.

a
a
b
b
c
c
d
d
e
e
29
Topological Observations
How do we measure how information flows through
the network?
  • We gather cascades using the following procedure
  • Find all initiators (out-degree 0).
  • Follow in-links.
  • Produces directed acyclic graph.

a
a
a
c
b
d
b
b
c
c
d
e
d
e
e
e
30
Topological Observations
How do we measure how information flows through
the network?
  • Common cascade shapes are extracted using
    algorithms in Leskovec2006.

31
Topological Observations
How do we measure how information flows through
the network?
  • Number of edges increases linearally with
    cascade size, while effective diameter increases
    logarithmically, suggesting tree-like structures.

Number of edges
Effective diameter
Cascade size ( nodes)
Cascade size
32
Topological Observations
How do we measure how information flows through
the network?
  • We work with a bag of cascades each cascade is
    a disconnected subgraph.
  • We now explore some graph properties of cascades.

33
Topological Observations
What graph properties do cascades exhibit?
  • As before, in- and out-degree in bag of cascades
    follow power laws.

Count
Count
Cascade node out-degree
Cascade node in-degree
34
Topological Observations
What graph properties do cascades exhibit?
  • Cascade size distributions also follow power law.

35
Topological Observations
What graph properties do cascades exhibit?
  • Cascade size distributions also follow power law.

Observation 2 The probability of observing a
cascade on n nodes follows a Zipf
distribution p(n) ? n-2
Count
Cascade size ( of nodes)
36
Topological Observations
What graph properties do cascades exhibit?
Stars and chains also follow a power law, with
different exponents (star -3.1, chain -8.5).
37
Topological Observations
What graph properties do cascades exhibit?
Stars and chains also follow a power law, with
different exponents (star -3.1, chain -8.5).
Count
Count
Size of chain ( nodes)
Size of star ( nodes)
38
Outline
  • Motivation
  • Preliminaries
  • Temporal Observations
  • Topological Observations
  • What are graph properties for blog networks?
  • What shapes and patterns do cascades take on?
  • Cascade Generation Model
  • Epidemiological Background
  • Proposed Model
  • Experimental Validation
  • Discussion Conclusions
  • Future Work

39
Epidemiological models
  • We consider modeling cascade generation as an
    epidemic, with ideas as viruses.
  • We use the SIS model
  • At any time, an entity is in one of two states
    susceptible or infected.
  • One parameter ? determines how easily spreading
    conversations are.
  • Hethcote2000

40
Epidemiological models
  • We consider modeling cascade generation as an
    epidemic, with ideas as viruses.
  • We use the SIS model
  • At any time, an entity is in one of two states
    susceptible or infected.
  • One parameter ? determines how easily spreading
    conversations are.
  • Hethcote2000

41
Epidemiological models
  • We consider modeling cascade generation as an
    epidemic, with ideas as viruses.
  • We use the SIS model
  • At any time, an entity is in one of two states
    susceptible or infected.
  • One parameter ? determines how easily spreading
    conversations are.
  • Hethcote2000

42
Epidemiological models
  • We consider modeling cascade generation as an
    epidemic, with ideas as viruses.
  • We use the SIS model
  • At any time, an entity is in one of two states
    susceptible or infected.
  • One parameter ? determines how easily spreading
    conversations are.
  • Hethcote2000

43
Epidemiological models
  • We consider modeling cascade generation as an
    epidemic, with ideas as viruses.
  • We use the SIS model
  • At any time, an entity is in one of two states
    susceptible or infected.
  • One parameter ? determines how easily spreading
    conversations are.
  • Hethcote2000

44
Epidemiological models
  • We consider modeling cascade generation as an
    epidemic, with ideas as viruses.
  • We use the SIS model
  • At any time, an entity is in one of two states
    susceptible or infected.
  • One parameter ? determines how easily spreading
    conversations are.
  • Hethcote2000

45
Epidemiological models
  • We consider modeling cascade generation as an
    epidemic, with ideas as viruses.
  • We use the SIS model
  • At any time, an entity is in one of two states
    susceptible or infected.
  • One parameter ? determines how easily spreading
    conversations are.
  • Hethcote2000

46
Epidemiological models
  • We consider modeling cascade generation as an
    epidemic, with ideas as viruses.
  • We use the SIS model
  • At any time, an entity is in one of two states
    susceptible or infected.
  • One parameter ? determines how easily spreading
    conversations are.
  • Hethcote2000

47
Cascade Generation Model
0. Begin with Blog Net.
1
B1
B2
1
2
1
1
3
B3
B4
48
Cascade Generation Model
0. Begin with Blog Net, but ignore edge weights.
Example B1 links to B2, B2 links to B1, B4 links
to B2 and B1, as well as itself B3 is isolated,
linking to itself.
B1
B2
B3
B4
49
Cascade Generation Model
1. Randomly pick a blog to infect, add node to
cascade
B1
B1
B2
B3
B4
50
Cascade Generation Model
2. Infect each in-linked neighbor with
probability b.
B1
B1
B2
B3
B4
51
Cascade Generation Model
2. Infect each in-linked neighbor with
probability b.
DO NOT INFECT
B1
B1
B2
INFECT
B3
B4
52
Cascade Generation Model
3. Add infected neighbors to cascade.
B1
B1
B2
B4
B3
B4
53
Cascade Generation Model
4. Set old infected nodes to uninfected.
B1
B1
B2
B4
B3
B4
54
Cascade Generation Model
4. Set old infected nodes to uninfected.
Repeat steps 2-4 until no nodes are infected.
B1
B1
B2
B4
B3
B4
55
Cascade Generation Model
4. Set old infected nodes to uninfected.
Repeat steps 2-4 until no nodes are infected.
B1
B1
B2
B4
DO NOT INFECT
B3
B4
56
Cascade Generation Model
4. Set old infected nodes to uninfected.
Repeat steps 2-4 until no nodes are infected.
Completed cascade!
B1
B1
B2
B4
B3
B4
57
CGM matches observations
  • After trying several values, we decide on ?.025.
  • 10 simulations, 2 million cascades each
  • Most frequent cascades 7 of 10 matched exactly.

model
data
58
CGM matches observations
Cascade size in this model also follows a power
law-- the model distribution is shown with the
real data points.
Count
Cascade size (number of nodes)
59
CGM matches observations
  • Stars and chains both follow power laws, close to
    those observed in real data.

Count
Count
Star size
Chain size
60
Results in brief
  • Analyzed one of largest available collections of
    blog information.
  • Two networks Post network and blog network.
  • Discovered several properties of the networks.
  • Also analyzed properties of cascades.
  • Presented generative model for cascades.

61
Immediate questions answered
  • Temporal questions Does popularity have
    half-life? Is there periodicity?
  • Popularity dropoff follows a power-law
    distribution exactly as found in response times
    in other work. We do find that posts follow
    weekly periodicity.

Number of in-links
Days after post
62
Immediate questions answered
  • Topology What topological patterns do posts and
    blogs follow? What shapes to cascades take on?
    Stars? Chains? Something else?
  • We find power law distributions in almost every
    topological property. In cascade shapes, stars
    are more common than chains, and size of cascades
    follow a power law. Cascades are tree-like.

Count
Count
Size of chain ( nodes)
Size of star ( nodes)
63
Immediate questions answered
  • Can a simple model replicate this behavior?
  • Yes. We developed a model based on the SIS model
    in epidemiology. It is a simple model with only
    one parameter, and it produces behavior
    remarkably similar to that found in the dataset.

Count
Count
Star size
Chain size
64
Future work and applications
  • This work suggested that ideas may behave like
    viruses under an SIS model.
  • This may be useful for mapping social/political
    trends.
  • Further investigation into these properties may
    also allow us early detection of changes in
    social or economic structure.

65
Related work
  • For explanation of SIS model
  • Hethcote2000 H.W. Hethcote. The mathematics
    of infectious diseases. SIAM Rev., 42(4)599653,
    2000.
  • For algorithms for extracting cascade shapes
  • Leskovec2006 J. Leskovec, A. Singh, and J.
    Kleinberg. Patterns of influence in a
    recommendation network. PAKDD 2006.
  • For some modeling of power laws
  • Vazquez2006 A. Vazquez, J. G. Oliveira, Z.
    Dezso, K. I. Goh, I. Kondor, and A. L. Barabasi.
    Modeling bursts and heavy tails in human
    dynamics. Physical Review E, 73036127, 2006.

66
Additional Info
  • Mary McGlohon
  • www.cs.cmu.edu/mmcgloho
  • mcglohon_at_cmu.edu

67
Acknowledgments
  • Mary McGlohon was partially supported by an NSF
    Graduate Fellowship.
  • Jure Leskovec was partially supported by a
    Microsoft Fellowship.

67
68
Questions?
69
  • EXTRA SLIDES BEGIN HERE!

70
Preliminaries- PCA
  • We will work with very high-dimensional data
    (9,000 dimensions).
  • Principal Component Analysis is a method of
    dimensionality reduction.

Hypothetically, for each blog...
Depth upwards
Conversation mass upwards
70
71
Preliminaries- PCA
  • We will work with very high-dimensional data
    (9,000 dimensions).
  • Principal Component Analysis is a method of
    dimensionality reduction.

Hypothetically, for each blog...
Depth upwards
Conversation mass upwards
71
72
Preliminaries- PCA
  • We will work with very high-dimensional data
    (9,000 dimensions).
  • Principal Component Analysis is a method of
    dimensionality reduction.

Hypothetically, for each blog...
Depth upwards
Conversation mass upwards
72
73
Preliminaries- PCA
We can represent any real N x M matrix X as X U
x ? x Vt
where U is size N x r, r is the rank of matrix
X, ? is diagonal r x r matrix and V is M x r.
X
U
?
Vt
74
Preliminaries- PCA
  • Reduce dimensionality by setting all other
    components of ? to zero.

x
x

75
Preliminaries- PCA
x
x
  • Reference Fukunaga, K. (1990). Introduction to
    Statistical Pattern Recognition, Academic Press.

76
Preliminaries- Regularizing data
  • Not everything in life is normally distributed. ?

Blog properties, linear-linear scale
Total In-links
Total Conversation Mass Downwards
77
Preliminaries- Regularizing data
  • Not everything in life is normally distributed. ?

Blog properties, linear-linear scale
Total In-links
99.4 of points!
Total Conversation Mass Downwards
78
Preliminaries Regularizing data
  • Not everything in life is normally distributed. ?

Blog properties, linear-linear scale
Try to fit a line...
Total In-links
Total Conversation Mass Downwards
79
Preliminaries Regularizing data
  • Not everything in life is normally distributed. ?

Blog properties, linear-linear scale
Try to fit a line... Outliers dramatically
affect fit.
Total In-links
Total Conversation Mass Downwards
80
Preliminaries Regularizing data
  • Not everything in life is normally distributed. ?
  • Therefore, we propose to take log(count1).

Blog properties, log-log scale
Total In-links
Total Conversation Mass Downwards
81
Preliminaries Regularizing data
  • Not everything in life is normally distributed. ?
  • Therefore, we propose to take log(count1).

Blog properties, log-log scale
Outliers effects are minimized.
Total In-links
Total Conversation Mass Downwards
82
  • Suppose we want to cluster blogs based on
    content. What features do we use per blog?

83
CascadeType
  • Perform PCA on sparse matrix.
  • Use log(count1)
  • Project onto 2 PC

9,000 cascade types

44,000 blogs
84
CascadeType Results
  • Observation Content of blogs and cascade
    behavior are often related.
  • Distinct clusters for conservative and
    humorous blogs (hand-labeling).

84
85
CascadeType Results
  • Observation Content of blogs and cascade
    behavior are often related.
  • Distinct clusters for conservative and
    humorous blogs (hand-labeling).

85
86
  • Suppose we want to cluster blog posts. What
    features do we use?

87
Preliminaries- Blogs
  • There are several terms we use to describe
    cascades
  • In-link, out-link
  • Green node has one out-link
  • Yellow node has one in-link.
  • Depth downwards/upwards
  • Pink node has an upward depth of 1,
  • downward depth of 2.
  • Conversation mass upwards/downwards
  • Pink node has upward CM 1,
  • downward CM 3

87
88
PostFeatures
in-links out-links CM up CM down
depth up depth down
Run PCA
2,400,000 posts
88
89
PostFeatures Results
  • Observation Posts within a blog tend to retain
    similar network characteristics.

90
PostFeatures Results
  • Observation Posts within a blog tend to retain
    similar network characteristics.
  • PC1 CM upward
  • PC2 CM downward
  • We show this scatter plot instead.

MichelleMalkin
Dlisted
91
Ranking blogs by PostFeatures
  • Conversation mass up/down gives a better
    understanding of the blog posts than in-links and
    out-links.
  • Therefore, we may choose to rank blogs based on
    these attributes.

91
92
Blogs ranked by CM vs in-links
Top blogs by conversation mass
Top blogs by in-links
  • michellemalkin.com
  • boingboing.net
  • imao.us (75)
  • captainsquartersblog.com/mt
  • instapundit.com
  • radioequalizer.blogspot.com (53)
  • powerlineblog.com
  • waxy.org/links
  • washingtonmonthly.com
  • kottke.org/reminder
  • boingboing.net
  • michellemalkin.com
  • instapundit.com
  • waxy.org/links
  • kottke.com/reminder
  • patriotdaily.com (11)
  • captainsquartersblog.com/mt
  • powerlineblog.com
  • washingtonmonthly.com
  • petashon.com (30)

92
93
Blogs ranked by CM vs in-links
Top blogs by conversation mass
Top blogs by in-links
  • michellemalkin.com
  • boingboing.net
  • imao.us (75)
  • captainsquartersblog.com/mt
  • boingboing.net
  • michellemalkin.com
  • instapundit.com
  • waxy.org/links

..... 10 petashon.com (30)
in-links 2 CM 6
in-links 5 CM 5
  • Perhaps IMAO has longer cascades, just fewer
    inlinks.
  • While petashun has stars.

93
94
BlogTimeFractal some time series
  • Problem time series data is nonuniform and
    difficult to analyze.
  • Any patterns?
  • Any measures?

in-links over time
95
BlogTimeFractal Definitions
  • Any patterns?
  • Self similarity!
  • The 80-20 law describes self-similarity.
  • For any sequence, we divide it into two
    equal-length subsequences. 80 of traffic is in
    one, 20 in the other.
  • Repeat recursively.

95
96
Self-similarity
  • The bias factor for the 80-20 law is b0.8.

20
80
97
Self-similarity
  • The bias factor for the 80-20 law is b0.8.

20
80
Q How do we estimate b?
98
Self-similarity
  • The bias factor for the 80-20 law is b0.8.

20
80
Q How do we estimate b?
A Entropy plots!
99
BlogTimeFractal
  • An entropy plot plots entropy vs. resolution.
  • From time series data, begin with resolution R
    T/2.
  • Record entropy HR

99
100
BlogTimeFractal
  • An entropy plot plots entropy vs. resolution.
  • From time series data, begin with resolution R
    T/2.
  • Record entropy HR
  • Recursively take finer resolutions.

100
101
BlogTimeFractal
  • An entropy plot plots entropy vs. resolution.
  • From time series data, begin with resolution r
    T/2.
  • Record entropy Hr
  • Recursively take finer resolutions.

101
102
BlogTimeFractal Definitions
  • Entropy measures the non-uniformity of histogram
    at a given resolution.
  • We define entropy of our sequence at given R
  • where p(t) is percentage of posts from a blog on
    interval t, R is resolution and 2R is number of
    intervals.

103
BlogTimeFractal
  • For a b-model (and self similar cases), entropy
    plot is linear. The slope s will tell us the
    bias factor.
  • Lemma For traffic generated by a b-model, the
    bias factor b obeys the equation
  • s - b log2 b (1-b) log2 (1-b)

103
104
Entropy Plots
  • Linear plot ? Self-similarity

Entropy
Resolution
105
Entropy Plots
  • Linear plot ? Self-similarity
  • Uniform slope s1. bias.5
  • Point mass s0. bias1

Entropy
Resolution
106
Entropy Plots
  • Linear plot ? Self-similarity
  • Uniform slope s1. bias.5
  • Point mass s0. bias1

Michelle Malkin in-links, s 0.85 By Lemma 1, b
0.72
Entropy
Resolution
107
BlogTimeFractal Results
  • Observation Most time series of interest are
    self-similar.
  • Observation Bias factor is approximately 0.7--
    that is, more bursty than uniform (70/30 law).

Entropy plots MichelleMalkin
in-links, b.72 conversation mass, b.76 number
of posts, b.70
107
108
  • Other related work

109
Ali-Hasen, Adamic 2007
Expressing Social Relationships on the Blog
through Links and Comments Analyzed three blog
communities
Dallas-Fort Worth -Most links are external to
community (91) -Low centralization -Low
reciprocity
UAE -Fewer links external to community -More
centralization -Obvious hub structure
Kuwait -Fewest links external to community
(53) -Highly centralized -Much reciprocity
110
Duarte et. al. 2007
  • Classified blogs into parlor, register, and
    broadcast.

register
Fractions of sessions with comments
parlor
broadcast
Total sessions
111
Adar et. al. 2004
  • Implicit Structure and the Dynamics of Blogspace
  • Suggested that ideas behaved like epidemics.
  • Presented iRank based on how infectious a blog
    was.

(giant microbes, a site infectious in more ways
than one)
Write a Comment
User Comments (0)
About PowerShow.com