Adversarial Information Retrieval

About This Presentation

Title:

Adversarial Information Retrieval

Description:

Adversarial Information Retrieval ... Search Engine Spamming Link-spam Link-bombing Spam Blogs ... spam detector Algorithm Select a small subset of ... – PowerPoint PPT presentation

Number of Views:103

Avg rating:3.0/5.0

Slides: 24

Provided by: Ryan172

Category:

more less

Transcript and Presenter's Notes

Title: Adversarial Information Retrieval

1
Adversarial Information Retrieval

The Manipulation of Web Content

2
Introduction

Examples
TrustRank and Other Methods

3
What is Adversarial IR?

Gathering, Indexing, Retrieving and Ranking
Information
Subset of the information has been manipulated
maliciously
Financial Gain

4
What is the Goal of AIR?

Detect the bad sites or communities
Improve precision on search engines by
eliminating the bad guys

5
Simplest form

First generation engines relied heavily on tf/idf
The top-ranked pages for the query maui resort
were the ones containing the most mauis and
resorts
SEOs responded with dense repetitions of chosen
terms
e.g., maui resort maui resort maui resort
Often, the repetitions would be in the same color
as the background of the web page
Repeated terms got indexed by crawlers
But not visible to humans on browsers

Pure word density cannot be trusted as an IR
signal
6
Search Engine Spamming

Link-spam
Link-bombing
Spam Blogs
Comment Spam
Keyword Spam
Malicious Tagging

7
Spamming

Online tutorials for search engine persuasion
techniques
How to boost your PageRank
Artificial links and Web communities
Latest trend Google bombing
a community of people create (genuine) links with
a specific anchor text towards a specific page.
Usually to make a political point

8
Google Bombing
9
Our Focus

Link Manipulation

10
Trust Rank

Observation
Good pages tend to link good pages.
Human is the best spam detector
Algorithm
Select a small subset of pages and let a human
classify them
Propagate goodness of pages

11
Propagation

Trust function T
T(p) returns the propability that p is a good
page
Initial values
T(p) 1, if p was found to be a good page
T(p) 0, if p was found to be a spam page
Iterations
propagate Trust following out-links
only a fixed number of iteration M.

12
Propagation (2)

Problem with propagation
Pages reachable from good seeds might not be good
the further away we are from good seed pages, the
less certain we are that a page is good.

solution reduce trust as we move further away
from the good seed pages (trust attenuation).

13
Trust attenuation dampening

Propagate a dampened trust score ß lt 1 at first
step
At n-th step propagate a trust of ßn

14
Trust attenuation splitting

Parent trust value is splittet among child nodes
Observation the more the links the less the care
in choosing them
Mix damp and split? ßn(splitted trust)

15
Selection Inverse PageRank

The seed set S should
be as small as possible
cover a large part of the Web
Covering is related to out-links in the very same
way PageRank is related to in-link
Inverse PageRank !
Perform PageRank on a graph with inverted links
G' (V, E') where (p,q) ? E' ??(q, p) ? E.

16
Algorithm

Select seeds ( s ) and order by preference
Invoke oracle (human) on the first L seeds,
Initialize and normalize oracle response d
Compute TrustRank score (as in PageRank
formula) t ß Tt(1-ß) d
T is the adjacency matrix of the Web Graph.
ß is the dampening factor. (usually .85)

17
Algorithm - example

s 0.08, 0.13, 0.08, 0.10, 0.09, 0.06, 0.02
Ordering 2, 4, 5, 1, 3, 6, 7
L3 2, 4, 5 d0, 0.5, 0, 0.5, 0, 0, 0
ß0.85 M20
t 0, 0.18, 0.12, 0.15, 0.13, 0.05, 0.05
NB. max0.18
Issues with page 1 and 5

18
Issues with TrustRank

Coverage of the seed set may not be broad enough
Many different topics exist, each with good pages
TrustRank has a bias towards communities that are
heavily represented in the seed set
inadvertently helps spammers that fool these
communities

19
Bias towards larger partitions

Divide the seed set into n partitions, each has
mi nodes
ti TrustRank score calculated by using
partition i as the seed set
t TrustRank score calculated by using all the
partitions as one combined seed set

20
Basic ideas

Use pages labeled with topics as seed pages
Pages listed in highly regarded topic directories
Trust should be propagated by topics
link between two pages is usually created in a
topic specific context

21
Topical TrustRank

Topical TrustRank
Partition the seed set into topically coherent
groups
TrustRank is calculated for each topic
Final ranking is generated by a combination of
these topic specific trust scores
Note
TrustRank is essentially biased PageRank
Topical TrustRank is fundamentally the same as
Topic-Sensitive PageRank, but for demoting spam

22
Combination of trust scores

Simple summation
default mechanism just seen
Quality bias
Each topic weighted by a bias factor
Summation of these weighted topic scores
One possible bias Average PageRank value of the
seed pages of the topic

23
Further Improvements

Seed Weighting
Instead of assigning an equal weight to each seed
page, assign a weight proportional to its quality
/ importance
Seed Filtering
Filtering out low quality pages that may exist in
topic directories
Finer topics
Lower layers of the topic directory

Write a Comment

User Comments (0)

About PowerShow.com

Adversarial Information Retrieval - PowerPoint PPT Presentation

Adversarial Information Retrieval

Adversarial Information Retrieval ... Search Engine Spamming Link-spam Link-bombing Spam Blogs ... spam detector Algorithm Select a small subset of ... – PowerPoint PPT presentation