Group Formation in Large Social Networks: Membership, Growth, and Evolution - PowerPoint PPT Presentation

About This Presentation
Title:

Group Formation in Large Social Networks: Membership, Growth, and Evolution

Description:

... of on-line communities and social media (MySpace, Face book, LiveJournal, Flickr) ... Cumulative set of words in titles is a proxy for top-level topics ... – PowerPoint PPT presentation

Number of Views:83
Avg rating:3.0/5.0
Slides: 51
Provided by: csK4
Learn more at: https://www.cs.kent.edu
Category:

less

Transcript and Presenter's Notes

Title: Group Formation in Large Social Networks: Membership, Growth, and Evolution


1
Group Formation in Large Social Networks
Membership, Growth, and Evolution
  • Lars Backstrom, Dan Huttenlocher, Joh Kleinberg,
    Xiangyang Lan
  • Presented by Natalia Dragan
  • Data Mining Techniques(CS6/73015-001) Fall06
  • Kent State University

2
Outline
  • Introduction
  • Motivation
  • Present work contributions
  • Related work
  • Membership, Growth, Evolution
  • Conclusions

3
Introduction
  • Structure of society people tend to come
    together and form groups
  • Why it is important to study
  • To understand better decision-making behavior
  • Tracking early stages of an epidemic
  • Following popularity of new ideas and
    technologies
  • Significant growth in the scale and richness of
    on-line communities and social media (MySpace,
    Face book, LiveJournal, Flickr)

4
Problems with the studying social processes
  • Easy to build theoretical models (subgraph
    branching out rapidly over links in the network
    or collection of small disconnected components
    growing in a speckled fashion)
  • Hard to make concrete empirical statements about
    these types of processes
  • Lack of reasonable vocabulary for studying group
    evolution

5
Present work
  • Principles by which groups develop and evolve in
    large-scale social networks
  • Crucial point focusing on networks where the
    members have explicitly identified themselves as
    a group or community
  • (vs. an unsupervised graph clustering problem of
    inferring community structures in a network)
  • 3 main types of questions Membership, Growth,
    Change

6
Membership, Growth, Change
  • Membership
  • Structural features that influence whether a
    given individual will join a particular group
  • Growth
  • Structural features that influence whether a
    given group will grow significantly over time
  • Change
  • How focus of interest changes over time
  • How these changes are correlated with changes in
    the set of group members

7
Related work
  • Identifying tightly-connected clusters within a
    given graph
  • Dill et al. consider implicitly identified
    communities
  • For a set of features (ZIP code, particular
    keyword, etc.) they consider a subgraph of the
    Web consisting of all pages containing this
    feature
  • Use of online social networks for data mining
  • The structure of the communities is not exploited
  • Diffusion on innovation study
  • Property which is diffusing in the present work
    membership in a given group

8
Diffusion of Innovations
  • Diffusion of Innovations is a theory that
    analyzes, as well as helps explain, the
    adaptation of a new innovation. In other words it
    helps to explain the process of social change.
  • An innovation is an idea, practice, or object
    that is perceived as new by an individual or
    other unit of adoption. The perceived newness of
    the idea for the individual determines his/her
    reaction to it (Rogers, 1995).
  • In addition, diffusion is the process by which an
    innovation is communicated through certain
    channels over time among the members of a social
    system. Thus, the four main elements of the
    theory are the innovation, communication
    channels, time, and the social system.
  • http//hsc.usf.edu/kmbrown/Diffusion_of_Innovatio
    ns_Overview.htm

9
Related work on Diffusion of Innovation
  • How a social network evolves as its members
    attributes change (Sarkar and Moore, Holme and
    Newman)
  • Social network evolution in a university setting
    (Kossinets and Watts)
  • Evolution of topics over time (Wang and McCallum)
  • Property which is diffusing in the present work
    is a membership in a given group

10
Road map
  • Introduction
  • Motivation
  • Present work contributions
  • Related work
  • Membership, Growth, Evolution
  • Conclusions

11
Sources of data
  • LiveJournal
  • Free on-line community with 10 mln members
  • 300,000 update the content in 24-hour period
  • Maintaining journals, individual and group blogs
  • Declaring who are their friends and to which
    communities they belong
  • DBLP
  • On-line database of computer science publications
    (about 400,000 papers)
  • Friendship network co-authors in the paper
  • Conference - community

12
Community Membership
  • Study of processes by which individuals join
    communities in a social network
  • Fundamental question about the evolution of
    communities who will join in the future?
  • Membership in a community behavior that
    spreads through the network
  • Diffusion of innovation study perspective for
    this question

13
Dependence on number of friends start towards
membership prediction
  • Underlying premise in diffusion studies an
    individual probability of adopting a new behavior
    increases with the number of friends (K) already
    engaging in the behavior
  • Theoretical models concentrate on the effect of
    K, while the structural properties are more
    influential in determining membership

14
Approach hypothesis
  • For moderate values of K an individual with K
    friends in a group is significantly more likely
    to join if these K fiends are themselves mutual
    friends than if there are not

15
Dependence on number of friends
  • 1st snapshot 2nd
    snapshot

.
.
.
.
.
- user (u) , C - community, - friend
.
.
.
.
C
.
.
C
.
.
.
.
.
.
.
.
.
.
.
.
.
K 3
.
.
.
.
.
.
P(k) 2/3
Probability P(k) of joining community fraction
of triples (u,C,k)
16
Law of Diminishing returns
  • In economics, diminishing returns is the short
    form of diminishing marginal returns. In a
    production system, having fixed and variable
    inputs, keeping the fixed inputs constant, as
    more of a variable input is applied, each
    additional unit of input yields less and less
    additional output. This concept is also known as
    the law of increasing opportunity cost or the law
    of diminishing returns.
  • http//en.wikipedia.org/wiki/Diminishing_returns

17
Dependence on number of friends LiveJournal
18
Dependence of number of friends DBLP
19
Discussion of results
  • The plots for LJ and DBLP have similar shapes,
    dominated by diminishing returns property
    (curve continues increasing, but more and more
    slowly even for large K)
  • P(2)gt2P(1) benefit of having a second friend is
    particularly strong (S-shaped behavior)
  • Curve for LJ is quite smooth (1/2 billion triples
    vs. 7.8 million for DBLP)

20
A broader range of features
  • Features related to the community C (11)
  • Number of members (C)
  • Number of individuals with a friend in C (fringe
    of C)
  • Number of edges with both ends in the community
    (Ec)
  • etc.
  • Features related to an individual u and her set S
    of friends in community C (8)
  • Number of friend in community (S)
  • Number of adjacent pairs in S
  • Number of pairs in S connected via a path in Ec
  • etc.

21
Estimating probability on a broader range of
features
  • Decision-tree techniques were applied to these
    features to make advances in estimating the
    probability of an individual joining a community
  • The technique incorporates
  • Individuals position in the network (structural
    features)
  • Level of activities among members (group
    features)

22
Predictions for LJ and DBLP
  • 1st snapshot
    2nd snapshot

Data point (u,C)
Probability U?C
.
Fringe
Fringe
u
.
.
.
.
.
.
C
.
C
LJ 14,448 joined community DBLP 71,618
joined community
LJ 17,076,344 data points, 875 communities
DBLP 7,651,013 data points
20 decisions tree were built for estimation about
joining
23
Splitting process for LJ
  • Each of 875 communities have half of their fringe
    members included in the training set (with the
    independent probability 0.5)
  • At each node in the decision tree
  • Every possible feature
  • Every binary split treshold for that feature
  • were examined
  • Of all such pairs the split which produces the
    largest decrease in entropy was chosen

24
Splitting process for LJ
  • New splits were installed until there are fewer
    than 100 positive cases at the node
  • Leaf nodes predict the ratio of positive to total
    cases for that node
  • Averaging technique
  • For every case the set of decision trees, for
    which this case is not included in the training
    set, were built
  • The average of these predictions is a prediction
    for the case

25
Averaging model (Simple description)
  • Selecting a model that explains the data from all
    the possible models, the one which better fits
    the data is usually selected.
  • But sometimes there is some model that explains
    really well the data, creating a model selection
    uncertainty, which is usually ignored.
  • BMA (Bayesian Model Averaging) provides a
    coherent mechanism for accounting for this model
    uncertainty, combining predictions from the
    different models according to their probability.
  • J. A.and Madigan D. Hoeting and A.E.and Volinsky
    C.T. Raftery. Bayesian model averaging A
    tutorial (With Discussion). Statistical Science,
    44(4)382--417, 1999

26
Averaging model (Simple description)
  • Example we have an evidence D and 3 possible
    hypothesis h1, h2 and h3. The posterior
    probabilities for those hypothesis are P( h1 D
    ) 0.4, P( h2 D ) 0.3 and P( h3 D ) 0.3
  • Giving a new observation, h1 classifies it as
    true and h2 and h3 classify it as false, then the
    result of the global classifier (BMA) would be
    calculated as follows

27
Top two level splits for predicting single
individuals joining communities in LJ
28
Performance achieved with the decision trees
Prediction performance for single individuals
joining communities in LJ
Prediction performance for single individuals
joining communities in DBLP
29
Internal connectedness of friends
Individuals whose friends in community are linked
to one another are significantly more likely to
join the community
30
Road map
  • Introduction
  • Motivation
  • Present work contributions
  • Related work
  • Membership, Growth, Evolution
  • Conclusions

31
Community Growth
  • Prediction task identifying which communities
    will grow significantly over a given period of
    time
  • Binary classification problem
  • Training set

Community growth
gt9
lt 18
Class 1 (49.4)
Class 0 (50.6)
Data set 13 570 communities To make predictions
100 decision trees on 100 independent samples
using the community features were built Binary
split is installed until a node has less than 50
data points
32
Top two levels of decision tree splits for
predicting community growth in LJ
  • The features and splits varied depending on the
    sample, but the top 2 splits were stable

33
Solution to the problem
  • Averaging tree techniques was used
  • Three baselines with a single feature were
    considered
  • Size of the community
  • Number of people in the fringe of the community
  • Ratio of these two features and combination of
    all three features

34
Results
Predicting community growth baselines based on
three different features, and performance using
all features By including the full set of
features predictions with reasonably good
performance were received
35
Road map
  • Introduction
  • Motivation
  • Present work contributions
  • Related work
  • Membership, Growth, Evolution
  • Conclusions

36
Movement between communities
  • How people and topics move between communities
  • Fundamental question given a set of overlapping
    communities
  • do topics tend to follow people
  • or do people tend to follow topics
  • Experiment set up 87 conferences for which there
    is DBLP data over at least 15-year period
  • Cumulative set of words in titles is a proxy for
    top-level topics

37
Time Series and Detected Bursts Term Bursts
.
.
OOPSLA03
  • .

.
.
.
Tw,C(y) 2/6
.
Micro-Pattern Evolution
Micro-Pattern in Java
.
.
.
Tw,C
Micro-Pattern is hot at OOPSLA in 2003
.
.
.
.
.
.
.
.
.
.
y
2000
2001
2002
2003
2004
2005
Term bursts
38
Time Series and Detected Bursts Term Bursts
  • Tw,C(y) fraction of paper titles at conference
    C in year y that contain the word w
  • Bursts in the usage of w
  • For each time series Tw,C is an interval in
    which Tw,C(y) is twice the average rate (burst
    rate)
  • Burst detection technique exploiting stochastic
    model for term generation is used
  • Burst intervals serve to identify the hot
    topics (focus of interest at a conference)
  • Word w is hot at conference C in year y if the
    year y is contained in a burst interval of the
    time series Tw,C

39
Time Series and Detected Bursts Movement Bursts
  • Author movement
  • Authors do not publish every year
  • Movement is asymmetric
  • Member of a conference C in year y
  • Has published there in the 5 years leading up to
    y
  • Author a moves into C from B in year y (B -gt C)
  • a has a paper in conference C in year y and
  • a is a member of B in year y-1
  • Property of two conferences and a year

.
.
C
B
Micro-Pattern Evolution
Smith
2002
2003
40
Time Series and Detected Bursts Movement Bursts
.
.
.
.
.
  • .

.
B
C
.
.
.
.
.
MB,C(y) 2/5
Brown
Dill
2002
2001
MB,C
.
.
.
.
.
.
.
.
.
.
.
.
.
y
2000
2001
2002
2003
2004
2005
Movement bursts
41
Time Series and Detected Bursts Movement Bursts
  • MB,C(y) fraction of authors at conference C in
    year y with the property that they are moving
    into C from B (B -gt C)
  • MB,C time series representing author movement
  • B -gt C movement bursts
  • an interval of y in which the value MB,C(y)
    exceeds the overall average by an absolute
    difference of .10
  • Burst detection is used to find burst intervals

42
Goal of the Experiment 1
  • Identify how word burst and movement burst
    intervals are aligned in time?
  • Word burst intervals identify hot terms
  • Movement burst intervals identify conference
    pairs B,C during which there was significant
    movement

43
Experiment 1 Papers contributing to Movement
Bursts
  • Characteristics of papers associated with some
    movement burst into a conference C
  • They exhibit different properties from arbitrary
    papers at C
  • Using of terms currently hot at C
  • Using of terms that will be hot at C in the
    future
  • Paper at C in y contributes to some movement
    burst at C
  • If one of the authors is moving B -gt C in y
  • y is a part of B -gt C movement bursts

.
.
Micro-pattern Evolution
Smith
OOPSLA03
ICPC02
2002
2004
2003
Movement burst
44
Papers contributing to Movement Bursts
  • Paper uses hot term
  • If one of the words in its title is hot for the
    conference and year in which it appears
  • Question do papers contributing to movement
    bursts differ from arbitrary papers in the way
    they use hot terms?

Papers contributing to a movement burst contain
elevated frequencies of currently and expired
hot terms, but lower frequencies of future hot
terms A burst of authors moving into C from B
are drawn to topics currently hot at C
45
Experiment 2 Alignment between different
conferences
  • Conferences B and C are topically aligned in a
    year y
  • If some word is hot at both B and C in year y
  • Property of two conference and a specific year
  • Hypothesis two conferences are more likely to be
    topically aligned in a given year if there is
    also a movement burst going between them

Micro-pattern
OOPSLA03
Micro-pattern
ICSM03
46
Results
  • 56.34 of all triples (B,C,y) such that there is
    B-gtC movement burst containing year y have the
    property that B and C are topically aligned in
    year y
  • 16.2 of all triples (B,C,y) have the property
    that B and C are topically aligned in year y
  • The presence of a movement burst between 2
    conferences enormously increases the chance they
    share a hot term

47
Movement bursts or term bursts come first?
  • There is a B -gt C movement burst, and hot terms w
    such that B and C are topically aligned via w in
    some year y inside the movement burst
  • 3 events of interest
  • The start of the burst for w at conference B
  • The start of the burst for w at conference C
  • The start of the B -gt C movement burst

48
Four patterns of author movement and topical
alignment
Term burst intervals
B -gt C movement burst
32
194
35
61
Shared interest is 50
more frequent than others Much more frequent for
B and C to have a shared burst term that is
already underway before the increase in author
movement takes place
49
Conclusions
  • The ways in which communities in social network
    grow over time were considered
  • At the level of individuals and their decision to
    join communities
  • At a more global level, in which a community can
    evolve in membership and content

50
Thank you!
Write a Comment
User Comments (0)
About PowerShow.com