Group Formation in Large Social Networks: Membership, Growth, and Evolution

About This Presentation

Title:

Group Formation in Large Social Networks: Membership, Growth, and Evolution

Description:

... of on-line communities and social media (MySpace, Face book, LiveJournal, Flickr) ... Cumulative set of words in titles is a proxy for top-level topics ... – PowerPoint PPT presentation

Number of Views:83

Avg rating:3.0/5.0

Slides: 51

Provided by: csK4

Learn more at: https://www.cs.kent.edu

Category:

more less

Transcript and Presenter's Notes

Title: Group Formation in Large Social Networks: Membership, Growth, and Evolution

1
Group Formation in Large Social Networks
Membership, Growth, and Evolution

Lars Backstrom, Dan Huttenlocher, Joh Kleinberg,
Xiangyang Lan
Presented by Natalia Dragan
Data Mining Techniques(CS6/73015-001) Fall06
Kent State University

2
Outline

Introduction
Motivation
Present work contributions
Related work
Membership, Growth, Evolution
Conclusions

3
Introduction

Structure of society people tend to come
together and form groups
Why it is important to study
To understand better decision-making behavior
Tracking early stages of an epidemic
Following popularity of new ideas and
technologies
Significant growth in the scale and richness of
on-line communities and social media (MySpace,
Face book, LiveJournal, Flickr)

4
Problems with the studying social processes

Easy to build theoretical models (subgraph
branching out rapidly over links in the network
or collection of small disconnected components
growing in a speckled fashion)
Hard to make concrete empirical statements about
these types of processes
Lack of reasonable vocabulary for studying group
evolution

5
Present work

Principles by which groups develop and evolve in
large-scale social networks
Crucial point focusing on networks where the
members have explicitly identified themselves as
a group or community
(vs. an unsupervised graph clustering problem of
inferring community structures in a network)
3 main types of questions Membership, Growth,
Change

6
Membership, Growth, Change

Membership
Structural features that influence whether a
given individual will join a particular group
Growth
Structural features that influence whether a
given group will grow significantly over time
Change
How focus of interest changes over time
How these changes are correlated with changes in
the set of group members

7
Related work

Identifying tightly-connected clusters within a
given graph
Dill et al. consider implicitly identified
communities
For a set of features (ZIP code, particular
keyword, etc.) they consider a subgraph of the
Web consisting of all pages containing this
feature
Use of online social networks for data mining
The structure of the communities is not exploited
Diffusion on innovation study
Property which is diffusing in the present work
membership in a given group

8
Diffusion of Innovations

Diffusion of Innovations is a theory that
analyzes, as well as helps explain, the
adaptation of a new innovation. In other words it
helps to explain the process of social change.
An innovation is an idea, practice, or object
that is perceived as new by an individual or
other unit of adoption. The perceived newness of
the idea for the individual determines his/her
reaction to it (Rogers, 1995).
In addition, diffusion is the process by which an
innovation is communicated through certain
channels over time among the members of a social
system. Thus, the four main elements of the
theory are the innovation, communication
channels, time, and the social system.
http//hsc.usf.edu/kmbrown/Diffusion_of_Innovatio
ns_Overview.htm

9
Related work on Diffusion of Innovation

How a social network evolves as its members
attributes change (Sarkar and Moore, Holme and
Newman)
Social network evolution in a university setting
(Kossinets and Watts)
Evolution of topics over time (Wang and McCallum)
Property which is diffusing in the present work
is a membership in a given group

10
Road map

Introduction
Motivation
Present work contributions
Related work
Membership, Growth, Evolution
Conclusions

11
Sources of data

LiveJournal
Free on-line community with 10 mln members
300,000 update the content in 24-hour period
Maintaining journals, individual and group blogs
Declaring who are their friends and to which
communities they belong
DBLP
On-line database of computer science publications
(about 400,000 papers)
Friendship network co-authors in the paper
Conference - community

12
Community Membership

Study of processes by which individuals join
communities in a social network
Fundamental question about the evolution of
communities who will join in the future?
Membership in a community behavior that
spreads through the network
Diffusion of innovation study perspective for
this question

13
Dependence on number of friends start towards
membership prediction

Underlying premise in diffusion studies an
individual probability of adopting a new behavior
increases with the number of friends (K) already
engaging in the behavior
Theoretical models concentrate on the effect of
K, while the structural properties are more
influential in determining membership

14
Approach hypothesis

For moderate values of K an individual with K
friends in a group is significantly more likely
to join if these K fiends are themselves mutual
friends than if there are not

15
Dependence on number of friends

1st snapshot 2nd
snapshot

.
.
.
.
.
- user (u) , C - community, - friend
.
.
.
.
C
.
.
C
.
.
.
.
.
.
.
.
.
.
.
.
.
K 3
.
.
.
.
.
.
P(k) 2/3
Probability P(k) of joining community fraction
of triples (u,C,k)
16
Law of Diminishing returns

In economics, diminishing returns is the short
form of diminishing marginal returns. In a
production system, having fixed and variable
inputs, keeping the fixed inputs constant, as
more of a variable input is applied, each
additional unit of input yields less and less
additional output. This concept is also known as
the law of increasing opportunity cost or the law
of diminishing returns.
http//en.wikipedia.org/wiki/Diminishing_returns

17
Dependence on number of friends LiveJournal
18
Dependence of number of friends DBLP
19
Discussion of results

The plots for LJ and DBLP have similar shapes,
dominated by diminishing returns property
(curve continues increasing, but more and more
slowly even for large K)
P(2)gt2P(1) benefit of having a second friend is
particularly strong (S-shaped behavior)
Curve for LJ is quite smooth (1/2 billion triples
vs. 7.8 million for DBLP)

20
A broader range of features

Features related to the community C (11)
Number of members (C)
Number of individuals with a friend in C (fringe
of C)
Number of edges with both ends in the community
(Ec)
etc.
Features related to an individual u and her set S
of friends in community C (8)
Number of friend in community (S)
Number of adjacent pairs in S
Number of pairs in S connected via a path in Ec
etc.

21
Estimating probability on a broader range of
features

Decision-tree techniques were applied to these
features to make advances in estimating the
probability of an individual joining a community
The technique incorporates
Individuals position in the network (structural
features)
Level of activities among members (group
features)

22
Predictions for LJ and DBLP

1st snapshot
2nd snapshot

Data point (u,C)
Probability U?C
.
Fringe
Fringe
u
.
.
.
.
.
.
C
.
C
LJ 14,448 joined community DBLP 71,618
joined community
LJ 17,076,344 data points, 875 communities
DBLP 7,651,013 data points
20 decisions tree were built for estimation about
joining
23
Splitting process for LJ

Each of 875 communities have half of their fringe
members included in the training set (with the
independent probability 0.5)
At each node in the decision tree
Every possible feature
Every binary split treshold for that feature
were examined
Of all such pairs the split which produces the
largest decrease in entropy was chosen

24
Splitting process for LJ

New splits were installed until there are fewer
than 100 positive cases at the node
Leaf nodes predict the ratio of positive to total
cases for that node
Averaging technique
For every case the set of decision trees, for
which this case is not included in the training
set, were built
The average of these predictions is a prediction
for the case

25
Averaging model (Simple description)

Selecting a model that explains the data from all
the possible models, the one which better fits
the data is usually selected.
But sometimes there is some model that explains
really well the data, creating a model selection
uncertainty, which is usually ignored.
BMA (Bayesian Model Averaging) provides a
coherent mechanism for accounting for this model
uncertainty, combining predictions from the
different models according to their probability.
J. A.and Madigan D. Hoeting and A.E.and Volinsky
C.T. Raftery. Bayesian model averaging A
tutorial (With Discussion). Statistical Science,
44(4)382--417, 1999

26
Averaging model (Simple description)

Example we have an evidence D and 3 possible
hypothesis h1, h2 and h3. The posterior
probabilities for those hypothesis are P( h1 D
) 0.4, P( h2 D ) 0.3 and P( h3 D ) 0.3
Giving a new observation, h1 classifies it as
true and h2 and h3 classify it as false, then the
result of the global classifier (BMA) would be
calculated as follows

27
Top two level splits for predicting single
individuals joining communities in LJ
28
Performance achieved with the decision trees
Prediction performance for single individuals
joining communities in LJ
Prediction performance for single individuals
joining communities in DBLP
29
Internal connectedness of friends
Individuals whose friends in community are linked
to one another are significantly more likely to
join the community
30
Road map

Introduction
Motivation
Present work contributions
Related work
Membership, Growth, Evolution
Conclusions

31
Community Growth

Prediction task identifying which communities
will grow significantly over a given period of
time
Binary classification problem
Training set

Community growth
gt9
lt 18
Class 1 (49.4)
Class 0 (50.6)
Data set 13 570 communities To make predictions
100 decision trees on 100 independent samples
using the community features were built Binary
split is installed until a node has less than 50
data points
32
Top two levels of decision tree splits for
predicting community growth in LJ

The features and splits varied depending on the
sample, but the top 2 splits were stable

33
Solution to the problem

Averaging tree techniques was used
Three baselines with a single feature were
considered
Size of the community
Number of people in the fringe of the community
Ratio of these two features and combination of
all three features

34
Results
Predicting community growth baselines based on
three different features, and performance using
all features By including the full set of
features predictions with reasonably good
performance were received
35
Road map

Introduction
Motivation
Present work contributions
Related work
Membership, Growth, Evolution
Conclusions

36
Movement between communities

How people and topics move between communities
Fundamental question given a set of overlapping
communities
do topics tend to follow people
or do people tend to follow topics
Experiment set up 87 conferences for which there
is DBLP data over at least 15-year period
Cumulative set of words in titles is a proxy for
top-level topics

37
Time Series and Detected Bursts Term Bursts
.
.
OOPSLA03

.
.
.
Tw,C(y) 2/6
.
Micro-Pattern Evolution
Micro-Pattern in Java
.
.
.
Tw,C
Micro-Pattern is hot at OOPSLA in 2003
.
.
.
.
.
.
.
.
.
.
y
2000
2001
2002
2003
2004
2005
Term bursts
38
Time Series and Detected Bursts Term Bursts

Tw,C(y) fraction of paper titles at conference
C in year y that contain the word w
Bursts in the usage of w
For each time series Tw,C is an interval in
which Tw,C(y) is twice the average rate (burst
rate)
Burst detection technique exploiting stochastic
model for term generation is used
Burst intervals serve to identify the hot
topics (focus of interest at a conference)
Word w is hot at conference C in year y if the
year y is contained in a burst interval of the
time series Tw,C

39
Time Series and Detected Bursts Movement Bursts

Author movement
Authors do not publish every year
Movement is asymmetric
Member of a conference C in year y
Has published there in the 5 years leading up to
y
Author a moves into C from B in year y (B -gt C)
a has a paper in conference C in year y and
a is a member of B in year y-1
Property of two conferences and a year

.
.
C
B
Micro-Pattern Evolution
Smith
2002
2003
40
Time Series and Detected Bursts Movement Bursts
.
.
.
.
.

.
B
C
.
.
.
.
.
MB,C(y) 2/5
Brown
Dill
2002
2001
MB,C
.
.
.
.
.
.
.
.
.
.
.
.
.
y
2000
2001
2002
2003
2004
2005
Movement bursts
41
Time Series and Detected Bursts Movement Bursts

MB,C(y) fraction of authors at conference C in
year y with the property that they are moving
into C from B (B -gt C)
MB,C time series representing author movement
B -gt C movement bursts
an interval of y in which the value MB,C(y)
exceeds the overall average by an absolute
difference of .10
Burst detection is used to find burst intervals

42
Goal of the Experiment 1

Identify how word burst and movement burst
intervals are aligned in time?
Word burst intervals identify hot terms
Movement burst intervals identify conference
pairs B,C during which there was significant
movement

43
Experiment 1 Papers contributing to Movement
Bursts

Characteristics of papers associated with some
movement burst into a conference C
They exhibit different properties from arbitrary
papers at C
Using of terms currently hot at C
Using of terms that will be hot at C in the
future
Paper at C in y contributes to some movement
burst at C
If one of the authors is moving B -gt C in y
y is a part of B -gt C movement bursts

.
.
Micro-pattern Evolution
Smith
OOPSLA03
ICPC02
2002
2004
2003
Movement burst
44
Papers contributing to Movement Bursts

Paper uses hot term
If one of the words in its title is hot for the
conference and year in which it appears
Question do papers contributing to movement
bursts differ from arbitrary papers in the way
they use hot terms?

Papers contributing to a movement burst contain
elevated frequencies of currently and expired
hot terms, but lower frequencies of future hot
terms A burst of authors moving into C from B
are drawn to topics currently hot at C
45
Experiment 2 Alignment between different
conferences

Conferences B and C are topically aligned in a
year y
If some word is hot at both B and C in year y
Property of two conference and a specific year
Hypothesis two conferences are more likely to be
topically aligned in a given year if there is
also a movement burst going between them

Micro-pattern
OOPSLA03
Micro-pattern
ICSM03
46
Results

56.34 of all triples (B,C,y) such that there is
B-gtC movement burst containing year y have the
property that B and C are topically aligned in
year y
16.2 of all triples (B,C,y) have the property
that B and C are topically aligned in year y
The presence of a movement burst between 2
conferences enormously increases the chance they
share a hot term

47
Movement bursts or term bursts come first?

There is a B -gt C movement burst, and hot terms w
such that B and C are topically aligned via w in
some year y inside the movement burst
3 events of interest
The start of the burst for w at conference B
The start of the burst for w at conference C
The start of the B -gt C movement burst

48
Four patterns of author movement and topical
alignment
Term burst intervals
B -gt C movement burst
32
194
35
61
Shared interest is 50
more frequent than others Much more frequent for
B and C to have a shared burst term that is
already underway before the increase in author
movement takes place
49
Conclusions

The ways in which communities in social network
grow over time were considered
At the level of individuals and their decision to
join communities
At a more global level, in which a community can
evolve in membership and content

50
Thank you!

Write a Comment

User Comments (0)