Trust, Influence and Bias in Social Media presentation

About This Presentation

Transcript and Presenter's Notes

Title: Trust, Influence and Bias in Social Media

1
Trust, Influence andBias in Social Media
Anupam Joshi Joint work with Tim Finin and
several studentsEbiquity Group,
UMBCjoshi_at_cs.umbc.eduhttp//ebiquity.umbc.edu/
2
Knowing Influencing your Audience

Your goal is to campaign for a presidentialcandid
ate
How can you track the buzz about him/her?
What are the relevant communities andbogs?
Which communities are supporters, which are
skeptical, which are put off by the hype?
Is your campaign having an effect? The desired
effect?
Which bloggers are influential with political
audience? Of these, which are already onboard and
which are lost causes?
To whom should you send details or talk to?

3
Knowing Influencing your Market

Your goal is to market Zune
How can you track the buzz aboutit?
What are the relevant communitiesand blogs?
Which communities are fans, whichare suspicious,
which are put offby the hype?
Is your advertising having an effect?The desired
effect?
Which bloggers are influential in this market? Of
these, which are already onboard and which are
lost causes?
To whom should you send details or evaluation
samples?

4
What is Influence?

the act or power of producing an effect without
apparent exertion of force or direct exercise of
command
Measurable Influence
The ability of a blogger to persuade another
blogger to
Take action by means of creating a new post about
the topic and commenting on the original (text
and graph mining) .
Quote the bloggers views in her post (text
mining) .
Link to the original post via trackbacks,
comments (graph mining) .
Link to the blogger through other means like
del.icio.us, digg, citeULike, Connotea, etc.
(graph mining)
Subscribe to the blog feed (graph mining) .

5
What is a Community
Political Blogs

A community in real world is represented in a
graph as a set of nodes that have more links
within the set than outside it.
Graph
Citation Network
Affiliation Network
Sentiment Information
Shared Resource (tags, videos..)

Twitter Network
Facebook Network
6
Finding Communities (and Feeds) That Matter
Analysis of Bloglines Feeds 83K publicly
listed subscribers 2.8M feeds, 500K are unique
26K users (35) use folders to organize
subscriptions Data collected in May 2006
Before Merge

Top Advertising Feeds
1. Adrants Marketing and Advertising News With
Attitude
2. Adverblog advertising and new media marketing
3. http//ad-rag.com
4. adfreak
5. AdJab
6. MIT Advertising Lab future of advertising and
advertising technology
7. AdPulp Daily Juice from the Ad Biz
8. Advertising/Design Goodness
Related Tags advertising marketing media
news design

After Merge
http//ftm.umbc.edu
7
Feeds That Matter

Top Feeds for Politics
Merged folders political, political blogs
Talking Points Memo by Joshua Micah Marshall
Daily Kos State of the Nation
Eschaton
The Washington Monthly
Wonkette, Politics for People with Dirty Minds
http//instapundit.com/
Informed Comment
Power Line
AMERICAblog Because a great nation deserves the
truth
Crooks and Liars

Top Feeds for Knitting
Merged folders knitting blogs
Yarn Harlotknitting
Wendy Knits!
See Eunny Knit!
the blue blog
Grumperina goes to local yarn shops and Home
Depot
You Knit What??
Mason-Dixon Knitting
knit and tonic
Crazy Aunt Purl
http//www.lollygirl.com/blog/

8
Special Properties of Social Datasets

Long Tail
80/20 Rule or Pareto distribution
Few blogs get most attention/links
Most are sparsely connected
Motivation
Web graphs are large, but sparse
Expensive to compute community structure over the
entire graph
Goal
Approximate the membership of the nodes using
only a small portion of the entire graph.

9
Special Properties of Social Datasets

Intuition
Communities defined by the core (A)
Membership of rest (B) approxi-mated by how they
link to the core
Direct Method
NCut (Baseline)
Approximation
Singular value decomposition (SVD)
sampling
Heuristic

10
Approximating Communities
ICWSM 08

SVD (low rank)
Sampling based Approach
Communities can be extracted by sampling only
columns from the head (Drineas et al.)
Heuristic Cluster head to find initial
communities. Assign cluster that the tail nodes
most frequently link to.

Nodes ordered by degree
r
11
Approximating Communities
ICWSM 08

Dataset A blog dataset of 6000 blogs.

Original Adjacency
Heuristic Approximation
Modularity 0.51
12
Approximating Communities
ICWSM 08
Similar Modularity
Lower Time
More Time
Low Modularity

Advantages faster detection using small portion
of graph, less memory
Complexity SVD O(n3), Ncut O(nk), Sampling
O(r3), Heuristic O(rk) where n blogs, k
clusters, r columns

13
Approximating Communities
ICWSM 08
Additional evaluations using Variation of
Information score
14

Tags are free meta-data!
Other semantic features
Sentiments
Named Entities
Readership information
Geolocation information
etc.
How to combine this for detecting communities?

15
Social Media Graphs
Links Between Nodes and Tags
Links Between Nodes
Simultaneous Cuts
16
Communities in Social Media
A community in the real world is identified in a
graph as a set of nodes that have more links
within the set than outside it and share similar
tags.
17
SimCUT Clustering Tags and Graphs
WebKDD 08
Nodes
Tags
Tags
Tags
Nodes
Nodes
Tags
Nodes
Fiedler Vector Polarity
ß 0 Entirely ignore link information ß 1 Equal
importance to blog-blog and blog-tag, ßgtgt 1 NCut
18
SimCUT Clustering Tags and Graphs
WebKDD 08
Clustering Only Links
Clustering Links Tags
ß 0 Entirely ignore link information ß 1 Equal
importance to blog-blog and blog-tag, ßgtgt 1 NCut
19
Datasets

Citeseer (Getoor et al.)
Agents, AI, DB, HCI, IR, ML
Words used in place of tags
Blog data
derived from the WWE/Buzzmetrics dataset
Tags associated with Blogs derived from
del.icio.us
For dimensionality reduction 100 topics derived
from blog homepages using LDA (Latent Dirichilet
Allocation)
Pairwise similarity computed
RBF Kernel for Citeseer
Cosine for blogs

20
Clustering Tags and Graphs
Clustering Only Links
Clustering Links Tags
21
Varying Scaling Parameter ß
ß gtgt 1
ß1
ß0
Accuracy 36
Accuracy 62
Accuracy 39
Only Graph
Only Tags
Graphs Tags
Higher accuracy by adding tag
information Simple Kmeans 23 Content only,
binary Content only 52 (Getoor et al. 2004)
22
Effect of Number of Tags, Clusters

Mutual Information
Measures the dependence between two random
variables.
Compares results with ground truth

Link only has lower MI
More Semantics helps
Citeseer
Similar results for real, blog datasets
23
Influence in Communities
http//michellemalkin.com/
http//instapundit.com
http//dailykos.com
http//volokh.com
http//crooksandliars.com
http//rightwingnews.com
Communities detected using Fast algorithm for
detecting community structure in networks, M.E.
J. Newman
24
Authority and Popularity

Authority
contributes to influence
Influence may be subjective.
A source, authoritative in one community could
influence another community negatively.
Within a community, an authoritative source is
influential.

Popularity
Authority and popularity often treated equally
On blog search engines, authority is measured
using inlinks, which is at best popularity
Popularity doesnt mean influence
Dilbert is extremely popular but not influential

25
Link Polarity Sentiment
26
Link Polarity and Bias

Linking alone is not indicator of influence
Polarity (/- sentiment) indicates type of
influence
Consistent negative/positive opinion indicates
bias
Link polarity/citation signal helps determine
trust

Strong Negative Opinion
Strongly Positive opinion
Mildly Negative opinion
Democrat Blog
Republican Blog
27
Propagating Influence

Based on work of Guha et al1 for modeling
propagation of trust and distrust. Framework
Mij represents influence/bias from user i to j.(0
lt Mij lt 1)
Mij is initialized to the polarity from i to j.
Belief Matrix M (sparse) represents initial set
of known beliefs
Goal is to compute all unknown values in M
Belief Matrix after ith atomic propagation
Mi1 Mi Ci
Combined Operator
Ci a1 M a2 MTM a3 MT a4 MMT
a 0.4, 0.4, 0.1, 0.1 represents weighing factor
1 Guha R, Kumar R, Raghavan P, Tomkins A.
Propagation of trust and distrust. In
Proceedings of the Thirteenth International World
Wide Web Conference, New York, NY, USA, May 2004.
ACM Press, 2004.

28
Recognizing subjectivity sentiment

Weve developed ?TFIDF as a simple
feature-engineering technique to increase the
accuracy of subjectivity detection and sentiment
analysis
Our preliminary analysis shows that ?TFIDF
Works well in different subject domains
Improves accuracy for documents of varying sizes
sentence fragments, sentences, paragraphs and
multi-paragraph documents
Helps on text classification tasks other than
sentiment analysis

29
Feature Engineering for Text Classification

Typical features words and/or phrases along with
term frequency or (better) TF-IDF scores
?TFIDF amplifies the training set signals by
using the ratio of the IDF for the negative and
positive collections
Results in a significant boost in accuracy

Text The quick brown fox jumped over the lazy
white dog. Features the 2, quick 1, brown 1, fox
1, jumped 1, over 1, lazy 1, white 1, dog 1, the
quick 1, quick brown 1, brown fox 1, fox jumped
1, jumped over 1, over the 1, lazy white 1, white
dog 1
30
?TFIDF BoW Feature Set

Value of feature t in document d is
Where
Ct,d count of term t in document d
Nt number of negative labeled training docs
with term t
Pt number of positive labeled training docs
with term t
Normalize to avoid bias towards longer documents
Gives greater weight to rare (significant) words
Downplays very common words
Similar to Unigram Bigram BoW in other aspects

31
Example ?TFIDF vs TFIDF vs TF

?tfidf tfidf tf
, city angels ,
cage is angels is the
mediocrity , city .
criticized of angels to
exhilarating maggie , of
well worth city of a
out well maggie and
should know angel who is
really enjoyed movie goers that
maggie , cage is it
it's nice seth , who
is beautifully goers in
wonderfully angels , more
of angels us with you
Underneath the city but

15 features with highest values for a review of
City of Angels
32
Improvement over TFIDF (Uni- Bi-grams)

Movie Reviews 88.1 Accuracy vs. 84.65 at 95
Confidence Interval
Subjectivity Detection (Opinionated or not)
91.26 vs. 89.4 at 99.9 Confidence Interval
Congressional Support for Bill (Voted for/
Against) 72.47 vs. 66.84 at 99.9 Confidence
Interval
Enron Email Spam Detection (Spam or not)
98.917 vs. 96.6168 at 99.995 Confidence
Interval
All tests used 10 fold cross validation
At least as good as mincuts subjectivity
detectors on movie reviews (87.2)

33
Link Polarity Experiments

Domain
Political Blogosphere
Dataset from Buzzmetrics2 provides post-post
link structure over 14 million posts
Few off-the-topic posts help aggregation
Potential business value
Reference Dataset
Hand-labeled dataset from Lada Adamic et al3
classifying political blogs into right and left
leaning bloggers
Timeframe 2004 presidential elections, over
1500 blogs analyzed
Overlap of 300 blogs between Buzzmetrics and
reference dataset
Goal
Classify the blogs in Buzzmetrics dataset as
democrat and republican and compare with
reference dataset

2 Lada A. Adamic and Natalie Glance, "The
political blogosphere and the 2004 US Election",
in Proceedings of the WWW-2005 Workshop
Buzzmetrics www.buzzmetrics.com

34
Evaluation of Link Polarity
Polarity Improves Classification by almost 26
Confusion Matrix

Accuracy 73
True positive (Recall) 78
False positive (FP) 31
True negative (Recall) 69
False negative (FN) 21
Precision (R) 75
Precision (D) 72

35
Trust Propagation Sample Data

Compensates for initial incorrect polarity
(DKAT)
Doesnt change correct polarity (AT-DK)
Assigns correct polarity for non-existent direct
links (AT-IP)
Numbers in italics are problematic (MM-AT)
Improve sentiment detection ?

36
MSM Classification Results
37
Interesting Observations

24 of 27 sources correct-ly classified
guardian, foxnews, human-eventsonline,
mediamatters
Outliers The Nation Boston Globe
Left and right leaning blogs talk negatively
about ny times abc news and positively
about raw story and examiner

38
Identifying Bias using KL Divergence
39
Conclusion
40
Conclusion

Using topic, social structure and opinions we can
develop a model for influence, bias and trust in
social media
We apply this framework on real-world data and
describe techniques for identifying influence
Splogs are a big issue we have developed
efficient techniques to detect them in near real
time
Does the Game Theoretic Nature of this system
raise fundamental new challenges for Data Mining

41
Assets Good, Bad and Wanted

How the assets (data, APIs) were helpful?
Where these assets failed to be helpful and why?
Since we go beyond search, search data not that
useful ?
Which research questions you would like to
address if you had unlimited access to assets?
Unlimited livespaces link and content data to
validate some of our approaches.
Use to place ads on social media sites

42
http//ebiquity.umbc.edu

Write a Comment

User Comments (0)

About PowerShow.com

Trust, Influence and Bias in Social Media PowerPoint PPT Presentation