Improving Revenue by System Integration and Cooperative Optimization - PowerPoint PPT Presentation

About This Presentation
Title:

Improving Revenue by System Integration and Cooperative Optimization

Description:

Improving Revenue by System Integration and Cooperative ... – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 23
Provided by: SvenGroth2
Category:

less

Transcript and Presenter's Notes

Title: Improving Revenue by System Integration and Cooperative Optimization


1
(No Transcript)
2
Introduction
Why Peer-to-Peer Web Search?
Vision Self-organizing P2P Web Search Engine
with Google-or-better functionality
  • Proof of Concept for Scalable Self-Organizing
  • Data Structures and Algorithms
  • (e.g., DHTs, Randomized Overlay Networks,
    Epidemic Spreading)
  • Testbed for CS Models, Algorithms, Technologies
  • and Experimental Platform
  • Better Search Result Quality (Precision, Recall,
    etc.)
  • Powerful Search Methods for Each Peer
  • (Concept-based Search, Query Expansion,
    Personalization, etc.)
  • Leverage Intellectual Input at Each Peer
  • (Bookmarks, Feedback, Query Logs, Click Streams,
    Evolving Web, etc.)
  • Collaboration among Peers
  • (Query Routing, Incentives, Fairness, Anonymity,
    etc.)
  • Breaking Information Monopolies

3
Introduction
What Google Cant Do
Killer queries (disregarding NLP QA,
multilingual, multimedia)
drama with three women making a prophecy to a
British nobleman that he will become king
4
Introduction
Outline
Vision
?

Demo


Efficient Top-k Search

Ontology-based Query Expansion

Exploiting User Behavior

Isolating Selfish Peers
5
Introduction
Outline
Vision
?
Demo
?

Efficient Top-k Search


Ontology-based Query Expansion

Exploiting User Behavior

Isolating Selfish Peers
6
Efficient Top-k Search
Efficient Top-k Search
TA efficient principled top-k query
processing with monotonic score aggr.
TA with sorted access only (NRA) (Fagin 01,
Güntzer/Kießling/Balke 01) can index lists
consider d at posi in Li E(d) E(d) ? i
highi s(ti,d) worstscore(d) aggrs(t?,d)
? ?E(d) bestscore(d) aggrworstscore(d),
aggrhigh? ? ?
E(d) if worstscore(d) gt min-k then add d to
top-k min-k minworstscore(d) d ?
top-k else if bestscore(d) gt min-k then
cand cand ? d s threshold max
bestscore(d) d? cand if threshold ? min-k
then exit
Data items d1, , dn
d1
s(t1,d1) 0.7 s(tm,d1) 0.2
Query q (t1, t2, t3)
Index lists
Rank Doc Worst-score Best-score
1 d78 0.9 2.4
2 d64 0.8 2.4
3 d10 0.7 2.4
Rank Doc Worst-score Best-score
1 d78 1.4 2.0
2 d23 1.4 1.9
3 d64 0.8 2.1
4 d10 0.7 2.1
k 1
Rank Doc Worst-score Best-score
1 d10 2.1 2.1
2 d78 1.4 2.0
3 d23 1.4 1.8
4 d64 1.2 2.0
d78 0.9
d1 0.7
d88 0.2
d23 0.8
d10 0.8
t1
Scan depth 1

Scan depth 2
Scan depth 3
d10 0.2
d78 0.1
d64 0.8
d23 0.6
d10 0.6
t2
  • Ex. Google
  • gt 10 mio. terms
  • gt 8 bio. docs
  • gt 4 TB index

d99 0.2
d34 0.1
d10 0.7
d78 0.5
d64 0.4
STOP!
t3

7
Probabilistic Pruning
Probabilistic Pruning of Top-k Candidates
TA family of algorithms based on invariant (with
sum as aggr)
worstscore(d)
bestscore(d)
score
  • Add d to top-k result, if worstscore(d) gt min-k
  • Drop d only if bestscore(d) lt min-k, otherwise
    keep in PQ

?
drop d from priority queue
bestscore(d)
min-k
? Often overly conservative (deep
scans, high memory for PQ)
scan depth
worstscore(d)
  • Approximate top-k with probabilistic
    guarantees

discard candidates d from queue if p(d) ? ?
? Erel. precision_at_k 1??
8
Experiments with TREC-12 Web Track
Experiments with TREC-12 Web-Track Benchmark
on .GOV corpus from TREC-12 Web track 1.25 Mio.
docs (html, pdf, etc.)
  • 50 keyword queries, e.g.
  • Lewis Clark expedition,
  • juvenile delinquency,
  • legalization Marihuana,
  • air bag safety reducing injuries death facts

TA-sorted Prob-sorted (smart) sorted
accesses 2,263,652 527,980 elapsed time
s 148.7 15.9 max queue size 10849 400 relative
precision 1 0.87 rank distance 0 39.5 score
error 0 0.031
9
Introduction
Outline
Vision
?
Demo
?
Efficient Top-k Search
?

Ontology-based Query Expansion


Exploiting User Behavior

Isolating Selfish Peers
10
Query Expansion
Query Expansion
Threshold-based query expansion substitute w by
(c1 ... ck) with all ci for which sim(w, ci)
? ?
Old hat in IR highly disputed for danger of
topic dilution
  • Approach to careful expansion
  • determine phrases from query or best initial
    query results
  • (e.g., forming 3-grams and looking up
    ontology/thesaurus entries)
  • if uniquely mapped to one concept
  • then expand with synonyms and weighted hyponyms
  • alternatively use statistical learning methods
  • for word sense disambiguation

Problem choice of threshold ?
11
Query Expansion Example
Query Expansion Example
From TREC 2004 Robust Track
Title International Organized Crime
Description Identify organizations that
participate in international criminal activity,
the activity, and, if possible, collaborating
organizations and the countries involved.
Query international0.1451.00,
META1.001.00gangdom1.001.00,
gangland0.7421.00, "organ0.2131.00
crime0.3121.00", camorra0.2541.00,
maffia0.3181.00, mafia0.1541.00,
"sicilian0.2011.00 mafia0.1541.00",
"black0.0661.00 hand0.0531.00",
mob0.1231.00, syndicate0.0931.00,
organ0.2131.00, crime0.3121.00,
collabor0.4150.20, columbian0.6860.20,
cartel0.4660.20, ...
  • 135530 sorted accesses in 11.073s.
  • Results
  • Interpol Chief on Fight Against Narcotics
  • Economic Counterintelligence Tasks Viewed
  • Dresden Conference Views Growth of Organized
    Crime in Europe
  • Report on Drug, Weapons Seizures in Southwest
    Border Region
  • SWITZERLAND CALLED SOFT ON CRIME
  • ...

12
Top-k with Query Expansion
Top-k Query Processing with Query Expansion
consider expandable query algorithm and
performance with score
?i?q max j?onto(i) sim(i,j)sj(d))
dynamic query expansion with incremental
on-demand merging of additional index lists
B tree index on terms
thesaurus / meta-index
algorithm
performance
performance
57 0.6
12 0.9
response time 0.7 throughput 0.6 queueing
0.3 delay 0.25 ...
44 0.4
14 0.8
44 0.4
52 0.4
28 0.6
33 0.3
17 0.55
75 0.3
61 0.5
...
44 0.5
...
much more efficient than threshold-based
expansion no threshold tuning no topic drift
13
Experiments with TREC-13 Robust Track
Experiments with TREC-13 Robust-Track Benchmark
on Acquaint corpus (news articles) 528 000 docs,
2 GB raw data, 8 GB for all indexes
50 most difficult queries, e.g.
transportation tunnel disasters Hubble
telescope achievements potentially expanded
into earthquake, flood, wind, seismology,
accident, car, auto, train, ...
astronomical, electromagnetic radiation, cosmic
source, nebulae, ...
no exp. static exp. static exp. incr.
merge (?0.1) (?0.3, (?0.3,
(?0.1) ?0.0) ?0.1) sorted acc.
1,333,756 10,586,175 3,622,686 5,671,493 random
acc. 0 555,176 49,783 34,895 elapsed time
s 9.3 156.6 79.6 43.8 max terms
4 59 59 59 relative prec.
0.934 1.0 0.541 0.786 precision_at_10
0.248 0.286 0.238 0.298 MAP
0.091 0.111 0.086 0.110
with Okapi BM25 probabilistic scoring model
14
Introduction
Outline
Vision
?
Demo
?
Efficient Top-k Search
?
Ontology-based Query Expansion
?

Exploiting User Behavior


Isolating Selfish Peers
15
Exploiting User Behavior
Exploiting Query Logs and Click Streams
from PageRank uniformly random choice of links
random jumps
Authority (page q) stationary prob. of
visiting q
16
Exploiting User Behavior
Exploiting Query Logs and Click Streams
from PageRank uniformly random choice of links
random jumps
to QRank query-doc transitions query-query
transitions doc-doc
transitions on implicit links (w/ thesaurus) with
probabilities estimated from log statistics
17
Exploiting User Behavior
Preliminary Experiments
Setup 70 000 Wikipedia docs, 18 volunteers
posing Trivial-Pursuit queries ca. 500 queries,
ca. 300 refinements, ca. 1000 positive clicks ca.
15 000 implicit links based on doc-doc similarity
  • Results (assessment by blind-test users)
  • QRank top-10 result preferred over PageRank in
    81 of all cases
  • QRank has 50.3 precision_at_10, PageRank has 33.9

Untrained example query philosophy
PageRank QRank
x 1.
Philosophy Philosophy 2. GNU free doc.
license GNU free doc. license 3. Free software
foundation Early modern philosophy 4. Richard
Stallman Mysticism 5. Debian Aristotle
18
Introduction
Outline
Vision
?
Demo
?
Efficient Top-k Search
?
Ontology-based Query Expansion
?
?
Exploiting User Behavior

Isolating Selfish Peers

19
Self-Organization for Isolating Selfish Peers
Collaborative P2P Search








20
Self-Organization for Isolating Selfish Peers
Self-Organization for Isolating Selfish Peers
  • Rationale
  • mimic evolution in biological / social networks
  • tag selfish vs. altruistic peers and bias
    interactions towards similar peers

Algorithm periodically do each peer compares
its utility with a random peer if the other
peer has higher utility then copy that
peers strategy and links (reproduction)
mutate with small probability change behavior,
change links
21
Self-Organization for Isolating Selfish Peers
Simulation Results for P2P File Sharing
  • peers generate queries and answer queries based
    on P ? 0,1
  • with extreme behaviors selfish P 1.0 and
    altruistic P 0.0
  • peer utility hits (queries answered)
  • mutation change P randomly

queries generated
hits
60
typical run for 104 peers
Selfishness reduces
50
40
average per node
30
Average performance increases
20
10
0
cycles
0
20
40
60
80
100

22
The End
Thank you!
Write a Comment
User Comments (0)
About PowerShow.com