Trust, Influence and Bias in Social Media - PowerPoint PPT Presentation

About This Presentation
Title:

Trust, Influence and Bias in Social Media

Description:

Your goal is to campaign for a presidential. candidate. How can you track ... Features: the 2, quick 1, brown 1, fox 1, jumped 1, over 1, lazy 1, white 1, dog ... – PowerPoint PPT presentation

Number of Views:77
Avg rating:3.0/5.0
Slides: 43
Provided by: anupam8
Category:
Tags: bias | fox1 | influence | media | social | trust

less

Transcript and Presenter's Notes

Title: Trust, Influence and Bias in Social Media


1
Trust, Influence andBias in Social Media
Anupam Joshi Joint work with Tim Finin and
several studentsEbiquity Group,
UMBCjoshi_at_cs.umbc.eduhttp//ebiquity.umbc.edu/
2
Knowing Influencing your Audience
  • Your goal is to campaign for a presidentialcandid
    ate
  • How can you track the buzz about him/her?
  • What are the relevant communities andbogs?
  • Which communities are supporters, which are
    skeptical, which are put off by the hype?
  • Is your campaign having an effect? The desired
    effect?
  • Which bloggers are influential with political
    audience? Of these, which are already onboard and
    which are lost causes?
  • To whom should you send details or talk to?

3
Knowing Influencing your Market
  • Your goal is to market Zune
  • How can you track the buzz aboutit?
  • What are the relevant communitiesand blogs?
  • Which communities are fans, whichare suspicious,
    which are put offby the hype?
  • Is your advertising having an effect?The desired
    effect?
  • Which bloggers are influential in this market? Of
    these, which are already onboard and which are
    lost causes?
  • To whom should you send details or evaluation
    samples?

4
What is Influence?
  • the act or power of producing an effect without
    apparent exertion of force or direct exercise of
    command
  • Measurable Influence
  • The ability of a blogger to persuade another
    blogger to
  • Take action by means of creating a new post about
    the topic and commenting on the original (text
    and graph mining) .
  • Quote the bloggers views in her post (text
    mining) .
  • Link to the original post via trackbacks,
    comments (graph mining) .
  • Link to the blogger through other means like
    del.icio.us, digg, citeULike, Connotea, etc.
    (graph mining)
  • Subscribe to the blog feed (graph mining) .

5
What is a Community
Political Blogs
  • A community in real world is represented in a
    graph as a set of nodes that have more links
    within the set than outside it.
  • Graph
  • Citation Network
  • Affiliation Network
  • Sentiment Information
  • Shared Resource (tags, videos..)

Twitter Network
Facebook Network
6
Finding Communities (and Feeds) That Matter
Analysis of Bloglines Feeds 83K publicly
listed subscribers 2.8M feeds, 500K are unique
26K users (35) use folders to organize
subscriptions Data collected in May 2006
Before Merge
  • Top Advertising Feeds
  • 1. Adrants Marketing and Advertising News With
    Attitude
  • 2. Adverblog advertising and new media marketing
  • 3. http//ad-rag.com
  • 4. adfreak
  • 5. AdJab
  • 6. MIT Advertising Lab future of advertising and
    advertising technology
  • 7. AdPulp Daily Juice from the Ad Biz
  • 8. Advertising/Design Goodness
  • Related Tags advertising  marketing  media 
    news  design 

After Merge
http//ftm.umbc.edu
7
Feeds That Matter
  • Top Feeds for Politics
  • Merged folders political, political blogs
  • Talking Points Memo by Joshua Micah Marshall
  • Daily Kos State of the Nation
  • Eschaton
  • The Washington Monthly
  • Wonkette, Politics for People with Dirty Minds
  • http//instapundit.com/
  • Informed Comment
  • Power Line
  • AMERICAblog Because a great nation deserves the
    truth
  • Crooks and Liars
  • Top Feeds for Knitting
  • Merged folders knitting blogs
  • Yarn Harlotknitting
  • Wendy Knits!
  • See Eunny Knit!
  • the blue blog
  • Grumperina goes to local yarn shops and Home
    Depot
  • You Knit What??
  • Mason-Dixon Knitting
  • knit and tonic
  • Crazy Aunt Purl
  • http//www.lollygirl.com/blog/

8
Special Properties of Social Datasets
  • Long Tail
  • 80/20 Rule or Pareto distribution
  • Few blogs get most attention/links
  • Most are sparsely connected
  • Motivation
  • Web graphs are large, but sparse
  • Expensive to compute community structure over the
    entire graph
  • Goal
  • Approximate the membership of the nodes using
    only a small portion of the entire graph.

9
Special Properties of Social Datasets
  • Intuition
  • Communities defined by the core (A)
  • Membership of rest (B) approxi-mated by how they
    link to the core
  • Direct Method
  • NCut (Baseline)
  • Approximation
  • Singular value decomposition (SVD)
  • sampling
  • Heuristic

10
Approximating Communities
ICWSM 08
  • SVD (low rank)
  • Sampling based Approach
  • Communities can be extracted by sampling only
    columns from the head (Drineas et al.)
  • Heuristic Cluster head to find initial
    communities. Assign cluster that the tail nodes
    most frequently link to.

Nodes ordered by degree
r
11
Approximating Communities
ICWSM 08
  • Dataset A blog dataset of 6000 blogs.

Original Adjacency
Heuristic Approximation
Modularity 0.51
12
Approximating Communities
ICWSM 08
Similar Modularity
Lower Time
More Time
Low Modularity
  • Advantages faster detection using small portion
    of graph, less memory
  • Complexity SVD O(n3), Ncut O(nk), Sampling
    O(r3), Heuristic O(rk) where n blogs, k
    clusters, r columns

13
Approximating Communities
ICWSM 08
Additional evaluations using Variation of
Information score
14
  • Tags are free meta-data!
  • Other semantic features
  • Sentiments
  • Named Entities
  • Readership information
  • Geolocation information
  • etc.
  • How to combine this for detecting communities?

15
Social Media Graphs
Links Between Nodes and Tags
Links Between Nodes
Simultaneous Cuts
16
Communities in Social Media
A community in the real world is identified in a
graph as a set of nodes that have more links
within the set than outside it and share similar
tags.
17
SimCUT Clustering Tags and Graphs
WebKDD 08
Nodes
Tags
Tags
Tags
Nodes
Nodes
Tags
Nodes
Fiedler Vector Polarity
ß 0 Entirely ignore link information ß 1 Equal
importance to blog-blog and blog-tag, ßgtgt 1 NCut
18
SimCUT Clustering Tags and Graphs
WebKDD 08
Clustering Only Links
Clustering Links Tags
ß 0 Entirely ignore link information ß 1 Equal
importance to blog-blog and blog-tag, ßgtgt 1 NCut
19
Datasets
  • Citeseer (Getoor et al.)
  • Agents, AI, DB, HCI, IR, ML
  • Words used in place of tags
  • Blog data
  • derived from the WWE/Buzzmetrics dataset
  • Tags associated with Blogs derived from
    del.icio.us
  • For dimensionality reduction 100 topics derived
    from blog homepages using LDA (Latent Dirichilet
    Allocation)
  • Pairwise similarity computed
  • RBF Kernel for Citeseer
  • Cosine for blogs

20
Clustering Tags and Graphs
Clustering Only Links
Clustering Links Tags
21
Varying Scaling Parameter ß
ß gtgt 1
ß1
ß0
Accuracy 36
Accuracy 62
Accuracy 39
Only Graph
Only Tags
Graphs Tags
Higher accuracy by adding tag
information Simple Kmeans 23 Content only,
binary Content only 52 (Getoor et al. 2004)
22
Effect of Number of Tags, Clusters
  • Mutual Information
  • Measures the dependence between two random
    variables.
  • Compares results with ground truth

Link only has lower MI
More Semantics helps
Citeseer
Similar results for real, blog datasets
23
Influence in Communities
http//michellemalkin.com/
http//instapundit.com
http//dailykos.com
http//volokh.com
http//crooksandliars.com
http//rightwingnews.com
Communities detected using Fast algorithm for
detecting community structure in networks, M.E.
J. Newman
24
Authority and Popularity
  • Authority
  • contributes to influence
  • Influence may be subjective.
  • A source, authoritative in one community could
    influence another community negatively.
  • Within a community, an authoritative source is
    influential.
  • Popularity
  • Authority and popularity often treated equally
  • On blog search engines, authority is measured
    using inlinks, which is at best popularity
  • Popularity doesnt mean influence
  • Dilbert is extremely popular but not influential

25
Link Polarity Sentiment
26
Link Polarity and Bias
  • Linking alone is not indicator of influence
  • Polarity (/- sentiment) indicates type of
    influence
  • Consistent negative/positive opinion indicates
    bias
  • Link polarity/citation signal helps determine
    trust

Strong Negative Opinion
Strongly Positive opinion
Mildly Negative opinion
Democrat Blog
Republican Blog
27
Propagating Influence
  • Based on work of Guha et al1 for modeling
    propagation of trust and distrust. Framework
  • Mij represents influence/bias from user i to j.(0
    lt Mij lt 1)
  • Mij is initialized to the polarity from i to j.
  • Belief Matrix M (sparse) represents initial set
    of known beliefs
  • Goal is to compute all unknown values in M
  • Belief Matrix after ith atomic propagation
  • Mi1 Mi Ci
  • Combined Operator
  • Ci a1 M a2 MTM a3 MT a4 MMT
  • a 0.4, 0.4, 0.1, 0.1 represents weighing factor
  • 1 Guha R, Kumar R, Raghavan P, Tomkins A.
    Propagation of trust and distrust. In
    Proceedings of the Thirteenth International World
    Wide Web Conference, New York, NY, USA, May 2004.
    ACM Press, 2004.

28
Recognizing subjectivity sentiment
  • Weve developed ?TFIDF as a simple
    feature-engineering technique to increase the
    accuracy of subjectivity detection and sentiment
    analysis
  • Our preliminary analysis shows that ?TFIDF
  • Works well in different subject domains
  • Improves accuracy for documents of varying sizes
    sentence fragments, sentences, paragraphs and
    multi-paragraph documents
  • Helps on text classification tasks other than
    sentiment analysis

29
Feature Engineering for Text Classification
  • Typical features words and/or phrases along with
    term frequency or (better) TF-IDF scores
  • ?TFIDF amplifies the training set signals by
    using the ratio of the IDF for the negative and
    positive collections
  • Results in a significant boost in accuracy

Text The quick brown fox jumped over the lazy
white dog. Features the 2, quick 1, brown 1, fox
1, jumped 1, over 1, lazy 1, white 1, dog 1, the
quick 1, quick brown 1, brown fox 1, fox jumped
1, jumped over 1, over the 1, lazy white 1, white
dog 1
30
?TFIDF BoW Feature Set
  • Value of feature t in document d is
  • Where
  • Ct,d count of term t in document d
  • Nt number of negative labeled training docs
    with term t
  • Pt number of positive labeled training docs
    with term t
  • Normalize to avoid bias towards longer documents
  • Gives greater weight to rare (significant) words
  • Downplays very common words
  • Similar to Unigram Bigram BoW in other aspects

31
Example ?TFIDF vs TFIDF vs TF
  • ?tfidf tfidf tf
  • , city angels ,
  • cage is angels is the
  • mediocrity , city .
  • criticized of angels to
  • exhilarating maggie , of
  • well worth city of a
  • out well maggie and
  • should know angel who is
  • really enjoyed movie goers that
  • maggie , cage is it
  • it's nice seth , who
  • is beautifully goers in
  • wonderfully angels , more
  • of angels us with you
  • Underneath the city but

15 features with highest values for a review of
City of Angels
32
Improvement over TFIDF (Uni- Bi-grams)
  • Movie Reviews 88.1 Accuracy vs. 84.65 at 95
    Confidence Interval
  • Subjectivity Detection (Opinionated or not)
    91.26 vs. 89.4 at 99.9 Confidence Interval
  • Congressional Support for Bill (Voted for/
    Against) 72.47 vs. 66.84 at 99.9 Confidence
    Interval
  • Enron Email Spam Detection (Spam or not)
    98.917 vs. 96.6168 at 99.995 Confidence
    Interval
  • All tests used 10 fold cross validation
  • At least as good as mincuts subjectivity
    detectors on movie reviews (87.2)

33
Link Polarity Experiments
  • Domain
  • Political Blogosphere
  • Dataset from Buzzmetrics2 provides post-post
    link structure over 14 million posts
  • Few off-the-topic posts help aggregation
  • Potential business value
  • Reference Dataset
  • Hand-labeled dataset from Lada Adamic et al3
    classifying political blogs into right and left
    leaning bloggers
  • Timeframe 2004 presidential elections, over
    1500 blogs analyzed
  • Overlap of 300 blogs between Buzzmetrics and
    reference dataset
  • Goal
  • Classify the blogs in Buzzmetrics dataset as
    democrat and republican and compare with
    reference dataset
  • 2 Lada A. Adamic and Natalie Glance, "The
    political blogosphere and the 2004 US Election",
    in Proceedings of the WWW-2005 Workshop
  • Buzzmetrics www.buzzmetrics.com

34
Evaluation of Link Polarity
Polarity Improves Classification by almost 26
Confusion Matrix
  • Accuracy 73
  • True positive (Recall) 78
  • False positive (FP) 31
  • True negative (Recall) 69
  • False negative (FN) 21
  • Precision (R) 75
  • Precision (D) 72

35
Trust Propagation Sample Data
  • Compensates for initial incorrect polarity
    (DKAT)
  • Doesnt change correct polarity (AT-DK)
  • Assigns correct polarity for non-existent direct
    links (AT-IP)
  • Numbers in italics are problematic (MM-AT)
  • Improve sentiment detection ?

36
MSM Classification Results
37
Interesting Observations
  • 24 of 27 sources correct-ly classified
  • guardian, foxnews, human-eventsonline,
    mediamatters
  • Outliers The Nation Boston Globe
  • Left and right leaning blogs talk negatively
    about ny times abc news and positively
    about raw story and examiner

38
Identifying Bias using KL Divergence
39
Conclusion
40
Conclusion
  • Using topic, social structure and opinions we can
    develop a model for influence, bias and trust in
    social media
  • We apply this framework on real-world data and
    describe techniques for identifying influence
  • Splogs are a big issue we have developed
    efficient techniques to detect them in near real
    time
  • Does the Game Theoretic Nature of this system
    raise fundamental new challenges for Data Mining

41
Assets Good, Bad and Wanted
  • How the assets (data, APIs) were helpful?
  • Where these assets failed to be helpful and why?
  • Since we go beyond search, search data not that
    useful ?
  • Which research questions you would like to
    address if you had unlimited access to assets?
  • Unlimited livespaces link and content data to
    validate some of our approaches.
  • Use to place ads on social media sites

42
http//ebiquity.umbc.edu
Write a Comment
User Comments (0)
About PowerShow.com