Opportunities and Challenges of Web Search and Mining

About This Presentation

Title:

Opportunities and Challenges of Web Search and Mining

Description:

Title: Ongoing Research Author: Lee-Feng Chien Last modified by: wkd Created Date: 4/24/2002 1:15:34 PM Document presentation format: – PowerPoint PPT presentation

Number of Views:145

Avg rating:3.0/5.0

Slides: 120

Provided by: LeeFen9

Category:

more less

Transcript and Presenter's Notes

Title: Opportunities and Challenges of Web Search and Mining

1
Opportunities and Challenges of Web Search and
Mining

Lee-Feng Chien (???)

Academia Sinica National Taiwan University
2
Outline

Web SE
Inside SE
Googles Business Models
Googles Impacts
Recent Development
Next-Generation WSE
Web Mining

3
WSE Google
Globalization!
4
WSE Google
5
Problems of WSE
Inside WSE . Fast . Coverage .
Accuracy
6
Problems of WSE
Inside WSE . Fast . Coverage .
Accuracy
Business . Profitable . Models .
Competitions
7
Problems of WSE
Business . Profitable . Models .
Competition
Inside WSE . Fast . Coverage .
Accuracy
Impacts . Web Computing . Knowledge
Windows . New Paradigm of Civilization
8
I. Some Must-Know Statistics
9
Online Language Populations

Source Global Reach (global-reach.biz/globstats)

10
Top Ten Languages in the Web
TOP TEN LANGUAGESIN THE INTERNET Internet Users,by Language AveragePenetration World PopulationEstimate for Language Language as ofTotal Internet Users
English 287,369,520 26.2 1,098,654,265 35.9
Chinese 105,484,112 8.0 1,321,669,200 13.2
Japanese 66,548,060 52.1 127,853,600 8.3
German 54,035,201 56.3 95,893,300 6.8
Spanish 53,670,063 13.9 386,413,200 6.7
French 35,034,269 9.3 375,164,185 4.4
Korean 30,670,000 41.0 74,730,000 3.8
Italian 28,610,000 49.3 57,987,100 3.6
Portuguese 23,058,254 10.3 224,664,100 2.9
Dutch 13,657,170 56.6 24,125,950 1.7
TOP TEN LANGUAGES 698,353,773 18.4 3,787,154,900 87.3
Rest of the Languages 101,686,725 3.9 2,602,992,587 12.7
WORLD TOTAL 800,040,498 12.5 6,390,147,487 100.0
More and more non-English users!

Source Internet World Stats

11
Web Content
More and more non-English pages
Source Network Wizards Jan 99 Internet Domain
Survey
12
Web Users and Pages (5 years ago)
Challenge of Scalability !
Chinese Users 110M Including 87M (CN), 4.9M
(HK), 8.8M (TW), 2.14M (SG), and others. Source
Global Reach, 2004
13
Number of Chinese Web Pages
573,000,000 pages
Scalability Problem !
14
Number of Web Pages
Billions Of Textual Documents IndexedAs of Sept
2, 2003
The worlds largest search engine ?

4,285,199,774 pages (Google)
4.28 billion Web pages, 880 million images, and
other documents

KEY GGGoogle, ATWAllTheWeb, INKInktomi,
TMATeoma, AVAltaVista. Source Search Engine
Watch
15
The top 10 Internet trends 2004 predicted by
eOneNet.com

1. World Internet population will continue to
grow at an exponential rate, with China taking
the lead in Asia having more than 100 million
Internet users.
2. Broadband Internet penetration will
continue to grow with China and US in the lead
with an expected growth rate exceeding 30 each.
3. Online retail sales will still be led by
the US with an expected revenue exceeding US80
billion.
4. Paid search will account for the biggest
online ad spending. With the successful paid
search business models of Google and Overture,
more search engines will offer paid search
advertising.

16
The top 10 Internet trends 2004 predicted by
eOneNet.com

5. Spams will increase at least 20 despite
the new US anti-spam law. The US legislators will
be forced to consider amending the anti-spam law
from an opt-out law to an opt-in law.
6. Ads placed in opt-in email newsletters will
increase 25 as legitimate marketers find this is
the easier way to comply with the anti-spam law
and a better way of targeting customers.
7. Rich media will continue to be hot. More
than 25 of online ads served will contain rich
media contents.

17
The top 10 Internet trends 2004 predicted by
eOneNet.com

8. 20 more small businesses will develop
their own websites or use the Internet as a sales
and marketing channel.
9. Entertainment online will be grow at a
rapid pace, with more sites offering videos and
digital music download services.
10. The Internet boom will revive with more
Internet companies going for IPO both in the US
and in Asia, in particular kicked off by the most
anticipated Google IPO in Spring.

18
II. Inside WSE
19
Components

Crawler/Spider
Index Server
Query Server
Document Delivery

20
Architecture
(1)
(3)
SE
1B queries/day
Index
Spider
(4)
Web
Archive
Browser
SE
Index
Indexer
SE
Index
Quality results
5B pages
. Freshness
(2)
Log
.Spam
(5)
Scalable
21
Spider

Get all Pages from the Web
Web Traverse
Challenges
Performance, e.g., Pages/Per PC
Coverage
Currency
Spam Filtering
Hidden Web

22
Index Server

Index occurrences of all words in the pages
Data Cleanness
Challenges
Space Overhead,pages/PC
Incremental
Scalability Distributed Processing
Multiple Languages

23
System Anatomy
24
Data Structure
Lexicon fit in memory two different forms Hit
list account for most space use 2 bytes to save
space Forward index barrels are sorted by
wordID. Inside barrel, sorted by docID Inverted
Index some content as the forward index,
but sorted by wordID. doc list is sorted by
docID
25
Query Server

Search Relevant URLs for queries via looking up
indices
Challenges
Speed, check queries/Per Sec
Functions supported
Localization

26
PageRank
27
PageRank (Cont.)

be the set of pages that point to u.
be the number of
links from u and let c be a factor used for
normalization, then
a simplified version of PageRank

28
Search Functions

Phrase search, e.g. "petite galerie"
Truncation, e.g. librar, womn
Constraining search, e.g. title"The Wall Street
Journal"
Proximity search, e.g. gold near silver
Boolean, e.g. noir film -"pinot noir"
Parentheses and Nested Boolean, e.g. silver and
not (gold or platinum)
Limit search, e.g. limit by date range
Capitalization, e.g. turkey vs. Turkey
Ranking fields and refine search
LiveTopics
Translate Service
Other

29
Document Delivery

Bottleneck of Bandwidth
Presentation
Caching
Queries, Search Results
Aakman Model

30
III. Business
31
What is Google?

Specialized web search engine
Founded in 1998 by 2 graduate students at
Stanford University (Larry Page and Sergey Brin)
Provides a comprehensive, relevant, and
easy-to-use web search and browsing service (free)

Googles features fast, unbiased, and accurate
results, allows access to over 4 billion web
pages, and over 800 million images (most
important valid web pages)

32
Company Facts
Employees 1,300 Languages spoken
34 Worldwide Offices 21 (Mostly in US
Europe) Annual Revenues 900m
33
Google Revenue

Revenue(an e-business)
½ from selling relevant text-based ads
(sponsored links near search results)
½ from licensing its search technology to
companies like Yahoo

Source
Eric Schmidt Interview,
PCWorld.com (January 30, 2002)

34
Sources of Revenue

Adwords (150,000 advertisers) sponsored links
ad
cost-per-click pricing only when people click
on the link
-- Advertisement is extremely cheap and
effective
i.e. Edmunds.com spent 250,000 a month in
advertising because 1 spent generated 1.70.
Google Search Appliance
an integrated hardware/software solution that
extends the power of Google to corporate
intranets and web servers
-- Customers include Cisco Systems, Sony,
Procter Gamble, Sun Microsystems, etc

35
Challenges (cont.)

Easy entry into the Search Engine Industry
Lack of customer lock-in (vs. Microsoft)
Google will focus on creating services to
voluntarily draw in customers
Large, well-known competitors are focusing on
in-house search technology (Yahoo, Microsoft,
AOL, eBay, Amazon)
Customers are becoming competitors (Yahoo, AOL)

36
Competitors Ebay and Amazon

Ebay (www.ebay.com) E-commerce
Web-based marketplace in which a community of
buyers and sellers are brought together to
browse, buy and sell various items
-- Business revenue Charges Proceeds (Fees)
(5) 0.01-25 (2.5) 25-1000 (1.25) over
1000
Amazon (www.amazon.com) E-commerce
a customer-centric company that sells a range
of products that it purchases from manufacturers
and distributors

37
Competitors Microsoft and Yahoo

Microsoft is developing its own search engine
-- Can lasso users into its search engine
through its operating system
-- Has the braniacs to implement top of the
line search engine technology
Yahoo was customer of Google (may now become
Googles biggest competitor)
-- Offers placement under sponsored links and
within actual results (unethical)

38
IV. Impacts
39
Impacts

Web Computing
Knowledge Windows
New Web OS

40
Web Computing

Faster than local search
Very-large scale of computing systems
Realize global users behaviors
Acquire global information sources

41
Web Computing

Local disc or global disc?
Personal information management?
Gmails
Photo search

42
Knowledge Windows

Windows of Information Search
Alliance with online databases
Windows of Personal Knowledge Management
Knowledge Windows

43
New Web OS

Merged with Linux OS
Software download from end-users
Information Service OS

44
V. New Gen. of WSE
45
Advanced Google

Is Google good enough?
Takano
Takano NII
Takano NII Japan
More about Google Services
http//www.google.com/options/

46
New Features in Google

Google Labs http//labs.google.com/
Google Desktop Search
Searching text, Web, Word, Excel, PowerPoint,
Outlook, AOL Instant Messenger
Google SMS
Searching phone book, dictionary, product prices,
Google Print
Searching books

47
(No Transcript)
48
Other Search Tools

A9.com (by Amazon)
Bookmark, history, discover, diary
Books, movies,
Clusty.com (by Vivisimo)
Clustering engine
Snap.com (by Idealab)
Sorting by popularity, satisfaction, Web
popularity, Web satisfaction, domain,
Alexa.com (by Amazon)
Average user review ratings,
Others Yahoo, AskJeeves, AOL Search, HotBot,
MSN, Netscape, Lycos, Altavista, LookSmart,
Gigablast, Overture, About, FindWhat, Teoma,
InformSearch,

49
Clusty.com
50
Example on Vivisimo
51
Vivisimo (cont.)
52
New Directions

Personalization
Photo search, email search filtering
Information Extraction
EX Scholar search
Information Agent
Deep Web Search

53
VI. Web Mining
54
Web Search/Information Retrieval
Millions of Users
55
Improving Search via Mining
Millions of Users
56
Valuable Web Resources
Knowledge Discovery
Hyper Links Anchor Texts Search Result
Pages Query Logs Query Session Logs Clicked
Stream Logs Deep Web, .
Web logs, texts, images,
Millions of Users
57
Discovered Knowledge
Knowledge Discovery
Users Preferences/Need Topic, Location,
Timing, Authority/Popularity Site, File,
People, Company, Product Clusters/Associations
/ Relations Site, Page, People,
Company, Product, Query
Web logs, texts, images,
Millions of Users
58
Web Mining for IR
Knowledge Discovery
Search Classification Clustering Cross-language
IR Information Extraction Text mining Filtering
Web logs, texts, images,
Millions of Users
59

CS 276 / LING 239IInformation Retrieval and Web
Mining
Prabhakar Raghavan and Hinrich Schütze
Course Description
Basic and advanced techniques for text-based
information systems efficient text indexing
Boolean, vector space, and probabilistic
retrieval models evaluation and interface
issues Web search including crawling, link-based
algorithms, and Web metadata text/Web
clustering, classification, wrapper, information
extraction, and collaborative filtering systems
text mining. Projects can be chosen from diverse
topics in information retrieval.

60
Computational Linguistics, 29 , Issue 3,
September 2003 .
61
Research at Web Knowledge Discovery Lab
62
Research at Web Knowledge Discovery Lab

Live series
LiveTrans
SIGIR04, ACL04, JCDL04
ACM Trans. On Information System, 2004
Online Translation of unknown queries via Web
LiveClassifier
WWW04, IJCNLP04
ACM Trans. on ALIP, 2004
Training classifiers and classifying short text
via Web

63
Research at Web Knowledge Discovery Lab

LiveCluster
CIKM04
ACM Trans. On Information System, 2004
Generating taxonomy from terms or documents

64
LiveTrans Cross-language Web Search
65
LiveClassifier Classifying search results into
user-defined classification tree
66
LiveClassifier Paper Title Categorization
Note no labeled training data
67
LiveCluster Taxonomy Generation
68
Terms Clustering
69
Query Clustering
70
(No Transcript)
71
Outline

Translating Unknown Queries (SIGIR04)
Training Text Classifiers (WWW04)
Generating Taxonomy/Topic Hierarchies (TOIS04)

72
Translating Unknown Queries

Anchor Text Mining
Probabilistic Modeling (ACM TALIP02)
Transitive Translation (ACM TOIS04)
Search-Result Page Mining
Translation Extraction Selection (JCDL04)
CLIR Other Applications (SIGIR04, ACL04)

Note First work dealing with online translation
73
Introduction (cont.)

Bottleneck of CLIR service
Real queries are often short
Out-of-dictionary terms
and might have local variations
Ex proper nouns, new terminologies,
Need for a powerful query translation engine
Up-to-date dictionary

English Terminologies Chinese Translation
Digital library ?????/?????
Banff ??/??
Ishikawa ???
NII Japan ????????
louvre museum ???
SARS ??????????/??/??
Clinton ???/???
Bill Gates ????
74
Web Mining of Query Translations
Source Term
TargetTranslations
TermTranslation
OOD
Yahoo lt-gt ??
Web Mining
Anchor-Text Mining
Search-Result Mining

Different problems for different resources

75
Anchor Text (Yahoo lt-gt ??)

Applies to most languages
Translation candidates are likely to appear in
the same anchor-text-set

76
Search Result Page (National Palace Museum vs.
?????)

Mixed-language characteristic in Chinese pages

77
Problems

Term extraction
Translation selection noisy reduction
Language pairs with limited corpora
Processing speed
Data cleanness (language identification)
Language independence

78
Term Extraction SCPCD
79
Term Selection Probabilistic Inference Model

Integrating anchor texts and link structures into
probabilistic inference model
Based on co-occurrence page authority

Page Authority
Co-occurrence

Page Rank
80
Observation of Anchor Text
81
Observation of Anchor Text
www.yahoo.com
www.yahoo.com.tw
Source Query
Taiwan -
Yahoo
- in USA
Yahoo
82
Observation of Anchor Text
www.yahoo.com
www.yahoo.com.tw
Translation Candidates

????
??
Taiwan -
Yahoo
- in USA
?? -
??
Yahoo
Anchor-Text Set
83
Observation of Anchor Text
www.yahoo.com
www.yahoo.com.tw
Page Authority
Co-occurrence
(in-link 187)
(in-link 21)

????
??
Taiwan -
Yahoo
- in USA
?? -
??
Yahoo
84
Search Result Mining
PAT-tree based term extraction method Chien,
SIGIR 97
Term Extraction
Search Engine
Source Query
Web Pages
Target Translations
Term Selection
85
Term Selection

How to decide the ranking?
S, Ti frequently co-occur in the same pages
Not necessarily true for synonyms and antonyms
S, Ti the result pages containing similar
co-occurring context terms as feature vectors

Query S
86
Chi-Square Test

Chi-Square Test a statistical method for
co-occurrence analysis Gale Church 91

a of pages containing both terms s and t b
of pages containing term s but not t c of
pages containing term t but not s d of pages
containing neither term s nor t N the total
number of pages, i.e., N abcd
87
Context Vector Analysis

Context Vector Analysis co-occurring context
terms as feature vectors
Similarity measure cosine measure

88
Indirect Association Problem
89
Competitive Linking Algorithm
s
system
t1
?? (Cisco)
?? (system)
Cisco
t2
St1
?? (information)
?? (network)
St2
?? (computer)
Fig. 6. An illustration showing a bipartite graph
generated by using Algorithm 2.

90
Combined Method

To take advantage of both methods
Anchor-text-based higher precision
Search-result-based higher coverage

Rm(s,t) Ranking of score in different methods
91
Experiments

Performance on Query Translation
Test Bed real query terms from the Dreamer
search engine log in Taiwan
228,566 unique terms, during a period of 3 months
in 1998
Random-query test set
50 query terms in Chinese, randomly selected from
the top 20,000 queries in the log
40 of them were out-of-dictionary

92
Random Query Test Set
Table 2. Coverage and top 15 inclusion rates obtained with the four different methods for the random-query set. Table 2. Coverage and top 15 inclusion rates obtained with the four different methods for the random-query set. Table 2. Coverage and top 15 inclusion rates obtained with the four different methods for the random-query set. Table 2. Coverage and top 15 inclusion rates obtained with the four different methods for the random-query set. Table 2. Coverage and top 15 inclusion rates obtained with the four different methods for the random-query set.
Method Top-1 Top-3 Top-5 Coverage
CV 40.0 54.0 54.0 68
X2 36.0 50.0 52.0 68
AT 20.0 32.0 32.0 32
Combined 44.0 64.0 66.0 72

Many query terms didnt appear in anchor-text
sets (coverage)

93
Other Experiments

430 popular Chinese queries, 67.4 top-1
inclusion rate
Common terms randomly selected 100 common nouns
and 100 common verbs from general-purpose Chinese
dictionary

94
Transitive Translation
95
Transitive Translation Model
96
Chinese-Japanese Translation
Table VII. Selected source query terms in traditional Chinese and their translations in English, simplified Chinese and Japanese respectively, which were extracted by our transitive translation model. Table VII. Selected source query terms in traditional Chinese and their translations in English, simplified Chinese and Japanese respectively, which were extracted by our transitive translation model. Table VII. Selected source query terms in traditional Chinese and their translations in English, simplified Chinese and Japanese respectively, which were extracted by our transitive translation model. Table VII. Selected source query terms in traditional Chinese and their translations in English, simplified Chinese and Japanese respectively, which were extracted by our transitive translation model.
Source terms (Traditional Chinese) Extracted target translations Extracted target translations Extracted target translations
Source terms (Traditional Chinese) English Simplified Chinese Japanese
?? ?? ??? ?? ???? ?? ?? ?? ??? ?? Sony Nike Stanford Sydney internet network homepage computer database information ?? ?? ??? ?? ??? ?? ?? ??? ??? ?? ??? ??? ??????? ???? ??????? ?????? ?????? ??????? ?????? ?????????
Table VI. Top-n inclusion rates obtained with different models for translating traditional Chinese queries into Japanese. Table VI. Top-n inclusion rates obtained with different models for translating traditional Chinese queries into Japanese. Table VI. Top-n inclusion rates obtained with different models for translating traditional Chinese queries into Japanese. Table VI. Top-n inclusion rates obtained with different models for translating traditional Chinese queries into Japanese. Table VI. Top-n inclusion rates obtained with different models for translating traditional Chinese queries into Japanese. Table VI. Top-n inclusion rates obtained with different models for translating traditional Chinese queries into Japanese.
Model Top1 Top2 Top3 Top4 Top5
Direct 10.5 12.8 14.3 15.1 15.1
Indirect 40.2 49.4 56.6 58.6 59.6
Transitive 42.9 51.4 58.6 61.3 61.9

97
Translation Lexicons with Regional Variations
(a) Taiwan
(b) Mainland China
(c) Hong Kong Figure
1 Examples of search-result pages in different
Chinese regions that were obtained via the
English query words George Bush from Google.
98
Summary

A work dealing with live translation of unknown
queries
Anchor-text-based
High precision for high-frequency terms
Effective for proper nouns in multiple languages
Not applicable if size of anchor-text set not
enough
Search-result-based
Exploit rich Web resources
High coverage for English-Chinese language pair

LiveCluster
Generating Taxonomy from terms or documents

100

Taxonomy Generation from Terms

101
Hierarchical Query Clustering
102
The Steps

Feature Extraction
Use co-occurred seed terms extracted from
retrieved top pages
Term Vector
Each query term is assigned a term vector
Record the co-occurred feature terms and their
frequency values in the retrieved documents.
Term Similarity
tfidf-based Cosine measurement
Hierarchical Term Clustering
Cluster popular query terms in the log into
initial categories
Query terms with similar features are grouped
into clusters.

103
Feature Extraction

Use co-occurred seed terms extracted from
retrieved top pages

nude
Co-occurred feature terms
Creative Nude Photography Network -- Fine Art
Nude and ... ... The Creative Nude and Erotic
Photography Network is the number one net portal
to the best in fine art nude and erotic
photography! Over 100 CNPN Member Sites ...
Nude Places... to be naked. Walking in the
forest, cruising the lake in open boats,
swimming, picnicking and nude photography are all
enjoyed in the nude. 60 minutes 39.95. ... A
Brave Nude World... A Brave Nude World! Warning
This site contains links to fine art nude
erotic photography. If you are under 18 or do not
wish to view this material, You can ...
tf/df
term
3/2
erotic photography
1/1
naked
2/2
photography
3/2
art

104
Term Weighting
105
Extraction of Basic Feature Terms

Performance of different features randomly
selected, hi-frequency, and seed terms
Popular queries not affected by ephemeral trends,
e.g., movie, basketball, mutual fund, etc.
More expressive and distinguishable in describing
a particular category
Two logs compared and extracted 9,709 overlapping
top query terms as feature terms

106
Task I Query Clustering (Cont.)

Feature Extraction
Use co-occurred seed terms extracted from
retrieved top pages
Term Vector
Each query term is assigned a term vector
Record the co-occurred feature terms and their
frequency values in the retrieved documents.
Term Similarity
TF IDF-based Cosine measurement
Hierarchical Term Clustering
Cluster popular query terms in the log into
initial categories
Query terms with similar features are grouped
into clusters.

107
Term Similarity
108
Hierarchical Term Clustering

Agglomerative hierarchical clustering (AHC)
Compute the similarity between all pairs of
clusters
Estimate similarity between all pairs of composed
terms
Use the lowest term similarity value as the
cluster similarity value
Merge the most similar (closest) two clusters
Complete linkage method
Update the cluster vector of the new cluster
Repeat steps 2 and 3 until only a single cluster
remains

109
(No Transcript)
110
Clustering Results
111
Cluster Partition
112
Quality Function
113
Quality Function (Cont.)
114
Quality Function (Cont.)
115
Preliminary Experiment