Masters Thesis Defense - PowerPoint PPT Presentation

About This Presentation
Title:

Masters Thesis Defense

Description:

Comparison of hop plots for ICWSM, WWE and Blogosphere (650K blog nodes, 1.4 million links) ... (WWE and Simulation) Community detection, modeling influence ... – PowerPoint PPT presentation

Number of Views:7672
Avg rating:3.0/5.0
Slides: 28
Provided by: hom4154
Category:
Tags: defense | masters | thesis | wwe

less

Transcript and Presenter's Notes

Title: Masters Thesis Defense


1
Generative Model To Construct Blog and Post
Networks In Blogosphere
  • Masters Thesis Defense
  • Amit Karandikar
  • Advisor Dr. Anupam Joshi
  • Committee Dr. Finin, Dr. Yesha, Dr. Oates
  • Date 1st May 2007
  • Time 930 am
  • Place ITE 325B

http//prefuse.org/gallery/
2
Outline
  • Introduction
  • Motivation
  • Thesis Contribution
  • Interactions in Blogosphere
  • Proposed Model
  • Experiments and Results
  • Conclusion

3
Introduction Generative Model To Construct Blog
and Post Networks In Blogosphere
  • Generative model
  • A generative model is a model for randomly /
    systematically generating the observed data using
    some input parameters.
  • Parameters could be latent or input to the model.

Blogosphere Blogosphere is the collective term
encompassing all blogs linked together forming as
a community or social network.
oates.myspace.com
yesha.blogspot.com
Blog network Network formed by considering each
blog single node. Post Network Network formed
considering post as a node ignoring its parent
blog.
joshi.blogspot.com
finin.livejournal.com
4
Basics ..
  • Graphs are everywhere .. and so are Power laws!!

In simple words, power law can be explained by
rich get richer phenomenon OR 20 of the
population holds 80 of the wealth
Considering web as a graph
Internet Mapping Project lumeta.com
Friendship Network Moody 01
Scale-free network Structure and properties
independent of network size Few high connectivity
node (hubs)
http//www.prefuse.org/gallery/
Properties of interest (graph theory) Average
degree of node, degree distribution, degree
correlation, distribution of strongly/weakly
connected components, clustering coefficient and
reciprocity
5
MotivationWhy simulate blog graphs?
  • Reduce time to generate data
  • - crawling the blogosphere over a few weeks
  • - sampling the right blogs to get a
    representative sample
  • Reduce time in preprocessing and data cleaning
  • - removing links pointing outside the dataset,
    outside the time frame
  • - splog removal 1
  • Generate graphs of different properties\sizes
  • - average degree of node, degree distributions
  • Testing of new algorithms for blog graphs
  • - e.g. spread of influence in blogosphere 2,
    community detection 3
  • Extrapolation
  • - how will fast growth affect the blogosphere
    properties?
  • - how does this affect the connected components?

6
Thesis Contribution
  • To propose a generative model for a blog-blog
    network using preferential attachment and uniform
    random attachment by modeling the interactions
    among bloggers
  • To generate post-post network as part of the
    generative model for blog graphs.
  • Compare the properties of the simulated blog and
    post networks with the properties observed in the
    available real blog datasets.
  • Datasets
  • Workshop on the Weblogging Ecosystem (WWE 2006)
  • http//weblogging2006.blogspot.com/
  • International Conference on Weblogs and Social
    Media (ICWSM 2007)
  • http//ebiquity.umbc.edu/blogger/icwsm-2007-blogs-
    dataset/

7
Why existing models are not enough?
Erdos-Renyi random model
Barabasi Albert preferential attachment web model
Preferential Attachment The likelihood of
linking to a popular website is higher
  • Two level network blog and post level
  • Inlinks and outlinks to and from posts
  • NEED to model blogger interactions

1 M. Newman, The structure and function of
complex networks, 2003 3 R. Albert,
Statistical mechanics of complex networks. PhD
thesis, 2001. 7 J. Leskovec, M. McGlohon, C.
Faloutsos, N. Glance, and M. Hurst, Cascading
behavior in large blog graphs, ICWSM, 2007 32
X. Shi, B. Tseng, and L. Adamic, Looking at the
blogosphere topology through different lenses
ICWSM, 2007
8
Interactions in blogosphere
  • Interesting findings from PEW Internet survey 1
  • - Blog writers are enthusiastic blog readers
  • - Most bloggers post infrequently
  • - Linking in the neighborhood preferential or
    random?
  • (friends blog, blogroll)
  • Blogger tend to link to some (how many?) of the
    posts that they read recently (often
    preferentially, sometimes random)
  • Is popularity (inlinks) proportional to blogger
    activity (outlinks)? NO 2
  • 1 A. Lenhart and S. Fox, Bloggers A portrait
    of the internets new storytellers.
  • 2 J. Leskovec, M. McGlohon, C. Faloutsos, N.
    Glance, and M. Hurst, Cascading behavior in
    large blog graphs, ICWSM 2007

Model parameters
9
Model Parameters
  • Probability of random reads (rR)
  • Probability of randomly selecting writer (rW)
  • Probability that new node does not link to the
    existing network (pD)
  • Growth exponent (g)
  • how many links should be added every step?

10
Proposed Model Blog view
1. Add new blog node 2. Select writer 3. Writers
read blog posts, write posts
Step1
I will not link to anyone!
Reciprocal links Strongly connected components
Subset of nodes having directed path from every
node to every other node Weakly connected
components Information flow
Step2
dailykos
Should I read - randomly? - preferentially?
Should I link to someone? If yes who? gtgt
Preferentially based on indegree of node
michellemalkin
Writer selection randomly? OR gtgt Preferentially
based on outdegree?
Random destination
Random writer
11
Proposed Model Post view
Blogger A
Blogger B
Post 3
Post 2
Post 2
Post 1
Post 1
Number of links?
12
Growth of blog graphs Densification
Densification 1 has been observed in various
real networks including blogosphere Number of
edges grows faster than number of nodes super
linear growth function
Reciprocity and clustering coefficient increase
with growth exponent
Average degree increases with growth (evolution
time)
1 J. Leskovec, M. McGlohon, C. Faloutsos, N.
Glance, and M. Hurst, Cascading behavior in
large blog graphs, ICWSM 2007
13
Properties of simulated blog network
14
Properties of simulated post network
15
Blogosphere Blog Inlinks distribution
Blogosphere follows power law distribution for
blog inlinks and outlinks, post inlinks and post
outlinks, component sizes, posts per blog, size
of cascades
Large number of blog nodes have very few inlinks
Power law distribution Slope -2.07
Very few blog nodes have very high inlinks
16
Simulation Blog Inlinks distribution
Power law distribution Slope -1.72
Similar curves are observed for properties of
simulated blog and posts networks
17
Power law distributions for various network sizes
Similar shape of curves for degree distributions
as observed by Shi et al 1 in the real
blogosphere.
1 X. Shi, B. Tseng, and L. Adamic, Looking at
the blogosphere topology through different
lenses, in ICWSM, 2007
18
Hop plotAverage neighborhood size Vs. Hop count
Hop plot shows the reachability of nodes in the
network After N hops, hop plot becomes constant
Reachability?
Comparison of hop plots for ICWSM, WWE and
Blogosphere (650K blog nodes, 1.4 million links)
pD probability that new node remains
disconnected
19
Simulation Scatter plot and degree correlations
Correlation Coefficients ICWSM 0.056 WWE
0.02 Simulation 0.1
Popular blogs (high inlinks)
Popular avid writers (high inlinks and outlinks)
Avid writers (high outlinks)
BA model correlation coefficient 1
Random writers (rW) helps to model low
correlation coefficient
Correlation coefficient close to zero means there
is NO definite relation between indegree and
outdegree of blog nodes
20
Distribution of SCC in blog and post network
(WWE and Simulation)
Community detection, modeling influence uses
connected components
21
Distribution of WCC in post network (WWE and
Simulation)
Power law distribution in WCC for post network
22
Simulation Posts per blog distribution
Posts per blog also follows a power law
distribution 1
Power law distribution Slope -1.71
1 J. Leskovec, M. McGlohon, C. Faloutsos, N.
Glance, and M. Hurst, Cascading behavior in
large blog graphs, ICWSM 2007
23
Effect of increase in blogs
Degree distributions almost the same
Reciprocity increases
Average degree increases
Clustering coefficient and reciprocity of the
post network is much less compared to the blog
network
24
Effect of parametersRandom reads (rR), random
writers (rW), disconnected nodes (pD)
Increasing rR (random reads), decreases
reciprocity because it reduces the likelihood of
getting reverse link
Empirically rW 0.35 (random writers) gives low
degree correlation and similar values for other
parameters as the blogosphere
Increasing pD reduces the size of largest WCC
25
Conclusion
  1. Simulation resembles blogosphere in degree
    distributions, degree correlations, reciprocity,
    average degree, clustering coefficient, component
    distribution for blog and post networks.
  2. Simulated post network is sparse compared to blog
    network and posts per blogs follows a power law
    distribution as observed in blogosphere.
  3. Useful tool for analysis of blogosphere, testing
    new algorithms and extrapolation (how will
    increase in X affect some Y?)

26
Future work
  • Can we model buzz and popularity in the post
    network?
  • What is the effect of buzz on the properties of
    the network?
  • In-depth temporal analysis of evolving blog
    graphs
  • Can we enrich the model with topical information?
  • How can we model the blogroll?

27
Questions?
  • Thank you!
  • Acknowledgements
  • Advisor, committee members, coauthors, friends
    at UMBC
  • Data
  • BlogPulse, ICWSM, WWE
Write a Comment
User Comments (0)
About PowerShow.com