Title: Group Formation in Large Social Networks: Membership, Growth, and Evolution
1Group Formation in Large Social Networks
Membership, Growth, and Evolution
- Lars Backstrom, Dan Huttenlocher, Joh Kleinberg,
Xiangyang Lan - Presented by Natalia Dragan
- Data Mining Techniques(CS6/73015-001) Fall06
- Kent State University
2Outline
- Introduction
- Motivation
- Present work contributions
- Related work
- Membership, Growth, Evolution
- Conclusions
3Introduction
- Structure of society people tend to come
together and form groups - Why it is important to study
- To understand better decision-making behavior
- Tracking early stages of an epidemic
- Following popularity of new ideas and
technologies - Significant growth in the scale and richness of
on-line communities and social media (MySpace,
Face book, LiveJournal, Flickr)
4Problems with the studying social processes
- Easy to build theoretical models (subgraph
branching out rapidly over links in the network
or collection of small disconnected components
growing in a speckled fashion) - Hard to make concrete empirical statements about
these types of processes - Lack of reasonable vocabulary for studying group
evolution
5Present work
- Principles by which groups develop and evolve in
large-scale social networks - Crucial point focusing on networks where the
members have explicitly identified themselves as
a group or community - (vs. an unsupervised graph clustering problem of
inferring community structures in a network) - 3 main types of questions Membership, Growth,
Change
6Membership, Growth, Change
- Membership
- Structural features that influence whether a
given individual will join a particular group - Growth
- Structural features that influence whether a
given group will grow significantly over time - Change
- How focus of interest changes over time
- How these changes are correlated with changes in
the set of group members
7Related work
- Identifying tightly-connected clusters within a
given graph - Dill et al. consider implicitly identified
communities - For a set of features (ZIP code, particular
keyword, etc.) they consider a subgraph of the
Web consisting of all pages containing this
feature - Use of online social networks for data mining
- The structure of the communities is not exploited
- Diffusion on innovation study
- Property which is diffusing in the present work
membership in a given group
8Diffusion of Innovations
- Diffusion of Innovations is a theory that
analyzes, as well as helps explain, the
adaptation of a new innovation. In other words it
helps to explain the process of social change. - An innovation is an idea, practice, or object
that is perceived as new by an individual or
other unit of adoption. The perceived newness of
the idea for the individual determines his/her
reaction to it (Rogers, 1995). - In addition, diffusion is the process by which an
innovation is communicated through certain
channels over time among the members of a social
system. Thus, the four main elements of the
theory are the innovation, communication
channels, time, and the social system. - http//hsc.usf.edu/kmbrown/Diffusion_of_Innovatio
ns_Overview.htm
9Related work on Diffusion of Innovation
- How a social network evolves as its members
attributes change (Sarkar and Moore, Holme and
Newman) - Social network evolution in a university setting
(Kossinets and Watts) - Evolution of topics over time (Wang and McCallum)
- Property which is diffusing in the present work
is a membership in a given group
10Road map
- Introduction
- Motivation
- Present work contributions
- Related work
- Membership, Growth, Evolution
- Conclusions
11Sources of data
- LiveJournal
- Free on-line community with 10 mln members
- 300,000 update the content in 24-hour period
- Maintaining journals, individual and group blogs
- Declaring who are their friends and to which
communities they belong - DBLP
- On-line database of computer science publications
(about 400,000 papers) - Friendship network co-authors in the paper
- Conference - community
12Community Membership
- Study of processes by which individuals join
communities in a social network - Fundamental question about the evolution of
communities who will join in the future? - Membership in a community behavior that
spreads through the network - Diffusion of innovation study perspective for
this question
13Dependence on number of friends start towards
membership prediction
- Underlying premise in diffusion studies an
individual probability of adopting a new behavior
increases with the number of friends (K) already
engaging in the behavior - Theoretical models concentrate on the effect of
K, while the structural properties are more
influential in determining membership
14Approach hypothesis
- For moderate values of K an individual with K
friends in a group is significantly more likely
to join if these K fiends are themselves mutual
friends than if there are not
15Dependence on number of friends
- 1st snapshot 2nd
snapshot
.
.
.
.
.
- user (u) , C - community, - friend
.
.
.
.
C
.
.
C
.
.
.
.
.
.
.
.
.
.
.
.
.
K 3
.
.
.
.
.
.
P(k) 2/3
Probability P(k) of joining community fraction
of triples (u,C,k)
16Law of Diminishing returns
- In economics, diminishing returns is the short
form of diminishing marginal returns. In a
production system, having fixed and variable
inputs, keeping the fixed inputs constant, as
more of a variable input is applied, each
additional unit of input yields less and less
additional output. This concept is also known as
the law of increasing opportunity cost or the law
of diminishing returns. - http//en.wikipedia.org/wiki/Diminishing_returns
17Dependence on number of friends LiveJournal
18Dependence of number of friends DBLP
19Discussion of results
- The plots for LJ and DBLP have similar shapes,
dominated by diminishing returns property
(curve continues increasing, but more and more
slowly even for large K) - P(2)gt2P(1) benefit of having a second friend is
particularly strong (S-shaped behavior) - Curve for LJ is quite smooth (1/2 billion triples
vs. 7.8 million for DBLP)
20A broader range of features
- Features related to the community C (11)
- Number of members (C)
- Number of individuals with a friend in C (fringe
of C) - Number of edges with both ends in the community
(Ec) - etc.
- Features related to an individual u and her set S
of friends in community C (8) - Number of friend in community (S)
- Number of adjacent pairs in S
- Number of pairs in S connected via a path in Ec
- etc.
21Estimating probability on a broader range of
features
- Decision-tree techniques were applied to these
features to make advances in estimating the
probability of an individual joining a community - The technique incorporates
- Individuals position in the network (structural
features) - Level of activities among members (group
features)
22Predictions for LJ and DBLP
- 1st snapshot
2nd snapshot
Data point (u,C)
Probability U?C
.
Fringe
Fringe
u
.
.
.
.
.
.
C
.
C
LJ 14,448 joined community DBLP 71,618
joined community
LJ 17,076,344 data points, 875 communities
DBLP 7,651,013 data points
20 decisions tree were built for estimation about
joining
23Splitting process for LJ
- Each of 875 communities have half of their fringe
members included in the training set (with the
independent probability 0.5) - At each node in the decision tree
- Every possible feature
- Every binary split treshold for that feature
- were examined
- Of all such pairs the split which produces the
largest decrease in entropy was chosen
24Splitting process for LJ
- New splits were installed until there are fewer
than 100 positive cases at the node - Leaf nodes predict the ratio of positive to total
cases for that node - Averaging technique
- For every case the set of decision trees, for
which this case is not included in the training
set, were built - The average of these predictions is a prediction
for the case
25Averaging model (Simple description)
- Selecting a model that explains the data from all
the possible models, the one which better fits
the data is usually selected. - But sometimes there is some model that explains
really well the data, creating a model selection
uncertainty, which is usually ignored. - BMA (Bayesian Model Averaging) provides a
coherent mechanism for accounting for this model
uncertainty, combining predictions from the
different models according to their probability.
- J. A.and Madigan D. Hoeting and A.E.and Volinsky
C.T. Raftery. Bayesian model averaging A
tutorial (With Discussion). Statistical Science,
44(4)382--417, 1999
26Averaging model (Simple description)
- Example we have an evidence D and 3 possible
hypothesis h1, h2 and h3. The posterior
probabilities for those hypothesis are P( h1 D
) 0.4, P( h2 D ) 0.3 and P( h3 D ) 0.3 - Giving a new observation, h1 classifies it as
true and h2 and h3 classify it as false, then the
result of the global classifier (BMA) would be
calculated as follows
27Top two level splits for predicting single
individuals joining communities in LJ
28Performance achieved with the decision trees
Prediction performance for single individuals
joining communities in LJ
Prediction performance for single individuals
joining communities in DBLP
29Internal connectedness of friends
Individuals whose friends in community are linked
to one another are significantly more likely to
join the community
30Road map
- Introduction
- Motivation
- Present work contributions
- Related work
- Membership, Growth, Evolution
- Conclusions
31Community Growth
- Prediction task identifying which communities
will grow significantly over a given period of
time - Binary classification problem
- Training set
Community growth
gt9
lt 18
Class 1 (49.4)
Class 0 (50.6)
Data set 13 570 communities To make predictions
100 decision trees on 100 independent samples
using the community features were built Binary
split is installed until a node has less than 50
data points
32Top two levels of decision tree splits for
predicting community growth in LJ
- The features and splits varied depending on the
sample, but the top 2 splits were stable
33Solution to the problem
- Averaging tree techniques was used
- Three baselines with a single feature were
considered - Size of the community
- Number of people in the fringe of the community
- Ratio of these two features and combination of
all three features
34Results
Predicting community growth baselines based on
three different features, and performance using
all features By including the full set of
features predictions with reasonably good
performance were received
35Road map
- Introduction
- Motivation
- Present work contributions
- Related work
- Membership, Growth, Evolution
- Conclusions
36Movement between communities
- How people and topics move between communities
- Fundamental question given a set of overlapping
communities - do topics tend to follow people
- or do people tend to follow topics
- Experiment set up 87 conferences for which there
is DBLP data over at least 15-year period - Cumulative set of words in titles is a proxy for
top-level topics
37Time Series and Detected Bursts Term Bursts
.
.
OOPSLA03
.
.
.
Tw,C(y) 2/6
.
Micro-Pattern Evolution
Micro-Pattern in Java
.
.
.
Tw,C
Micro-Pattern is hot at OOPSLA in 2003
.
.
.
.
.
.
.
.
.
.
y
2000
2001
2002
2003
2004
2005
Term bursts
38Time Series and Detected Bursts Term Bursts
- Tw,C(y) fraction of paper titles at conference
C in year y that contain the word w - Bursts in the usage of w
- For each time series Tw,C is an interval in
which Tw,C(y) is twice the average rate (burst
rate) - Burst detection technique exploiting stochastic
model for term generation is used - Burst intervals serve to identify the hot
topics (focus of interest at a conference) - Word w is hot at conference C in year y if the
year y is contained in a burst interval of the
time series Tw,C
39Time Series and Detected Bursts Movement Bursts
- Author movement
- Authors do not publish every year
- Movement is asymmetric
- Member of a conference C in year y
- Has published there in the 5 years leading up to
y - Author a moves into C from B in year y (B -gt C)
- a has a paper in conference C in year y and
- a is a member of B in year y-1
- Property of two conferences and a year
.
.
C
B
Micro-Pattern Evolution
Smith
2002
2003
40Time Series and Detected Bursts Movement Bursts
.
.
.
.
.
.
B
C
.
.
.
.
.
MB,C(y) 2/5
Brown
Dill
2002
2001
MB,C
.
.
.
.
.
.
.
.
.
.
.
.
.
y
2000
2001
2002
2003
2004
2005
Movement bursts
41Time Series and Detected Bursts Movement Bursts
- MB,C(y) fraction of authors at conference C in
year y with the property that they are moving
into C from B (B -gt C) - MB,C time series representing author movement
- B -gt C movement bursts
- an interval of y in which the value MB,C(y)
exceeds the overall average by an absolute
difference of .10 - Burst detection is used to find burst intervals
42Goal of the Experiment 1
- Identify how word burst and movement burst
intervals are aligned in time? - Word burst intervals identify hot terms
- Movement burst intervals identify conference
pairs B,C during which there was significant
movement
43Experiment 1 Papers contributing to Movement
Bursts
- Characteristics of papers associated with some
movement burst into a conference C - They exhibit different properties from arbitrary
papers at C - Using of terms currently hot at C
- Using of terms that will be hot at C in the
future - Paper at C in y contributes to some movement
burst at C - If one of the authors is moving B -gt C in y
- y is a part of B -gt C movement bursts
.
.
Micro-pattern Evolution
Smith
OOPSLA03
ICPC02
2002
2004
2003
Movement burst
44Papers contributing to Movement Bursts
- Paper uses hot term
- If one of the words in its title is hot for the
conference and year in which it appears - Question do papers contributing to movement
bursts differ from arbitrary papers in the way
they use hot terms?
Papers contributing to a movement burst contain
elevated frequencies of currently and expired
hot terms, but lower frequencies of future hot
terms A burst of authors moving into C from B
are drawn to topics currently hot at C
45Experiment 2 Alignment between different
conferences
- Conferences B and C are topically aligned in a
year y - If some word is hot at both B and C in year y
- Property of two conference and a specific year
- Hypothesis two conferences are more likely to be
topically aligned in a given year if there is
also a movement burst going between them
Micro-pattern
OOPSLA03
Micro-pattern
ICSM03
46Results
- 56.34 of all triples (B,C,y) such that there is
B-gtC movement burst containing year y have the
property that B and C are topically aligned in
year y - 16.2 of all triples (B,C,y) have the property
that B and C are topically aligned in year y - The presence of a movement burst between 2
conferences enormously increases the chance they
share a hot term
47Movement bursts or term bursts come first?
- There is a B -gt C movement burst, and hot terms w
such that B and C are topically aligned via w in
some year y inside the movement burst - 3 events of interest
- The start of the burst for w at conference B
- The start of the burst for w at conference C
- The start of the B -gt C movement burst
48Four patterns of author movement and topical
alignment
Term burst intervals
B -gt C movement burst
32
194
35
61
Shared interest is 50
more frequent than others Much more frequent for
B and C to have a shared burst term that is
already underway before the increase in author
movement takes place
49Conclusions
- The ways in which communities in social network
grow over time were considered - At the level of individuals and their decision to
join communities - At a more global level, in which a community can
evolve in membership and content
50 Thank you!