Opportunities and Challenges of Web Search and Mining - PowerPoint PPT Presentation

1 / 119
About This Presentation
Title:

Opportunities and Challenges of Web Search and Mining

Description:

Title: Ongoing Research Author: Lee-Feng Chien Last modified by: wkd Created Date: 4/24/2002 1:15:34 PM Document presentation format: – PowerPoint PPT presentation

Number of Views:145
Avg rating:3.0/5.0
Slides: 120
Provided by: LeeFen9
Category:

less

Transcript and Presenter's Notes

Title: Opportunities and Challenges of Web Search and Mining


1
Opportunities and Challenges of Web Search and
Mining
  • Lee-Feng Chien (???)

Academia Sinica National Taiwan University
2
Outline
  • Web SE
  • Inside SE
  • Googles Business Models
  • Googles Impacts
  • Recent Development
  • Next-Generation WSE
  • Web Mining

3
WSE Google
Globalization!
4
WSE Google
5
Problems of WSE
Inside WSE . Fast . Coverage .
Accuracy
6
Problems of WSE
Inside WSE . Fast . Coverage .
Accuracy
Business . Profitable . Models .
Competitions
7
Problems of WSE
Business . Profitable . Models .
Competition
Inside WSE . Fast . Coverage .
Accuracy
Impacts . Web Computing . Knowledge
Windows . New Paradigm of Civilization
8
I. Some Must-Know Statistics
9
Online Language Populations
  • Source Global Reach (global-reach.biz/globstats)

10
Top Ten Languages in the Web
TOP TEN LANGUAGESIN THE INTERNET Internet Users,by Language AveragePenetration World PopulationEstimate for Language Language as ofTotal Internet Users
English 287,369,520 26.2 1,098,654,265 35.9
Chinese 105,484,112 8.0 1,321,669,200 13.2
Japanese 66,548,060 52.1 127,853,600 8.3
German 54,035,201 56.3 95,893,300 6.8
Spanish 53,670,063 13.9 386,413,200 6.7
French 35,034,269 9.3 375,164,185 4.4
Korean 30,670,000 41.0 74,730,000 3.8
Italian 28,610,000 49.3 57,987,100 3.6
Portuguese 23,058,254 10.3 224,664,100 2.9
Dutch 13,657,170 56.6 24,125,950 1.7
TOP TEN LANGUAGES 698,353,773 18.4 3,787,154,900 87.3
Rest of the Languages 101,686,725 3.9 2,602,992,587 12.7
WORLD TOTAL 800,040,498 12.5 6,390,147,487 100.0
More and more non-English users!
  • Source Internet World Stats

11
Web Content
More and more non-English pages
Source Network Wizards Jan 99 Internet Domain
Survey
12
Web Users and Pages (5 years ago)
Challenge of Scalability !
Chinese Users 110M Including 87M (CN), 4.9M
(HK), 8.8M (TW), 2.14M (SG), and others. Source
Global Reach, 2004
13
Number of Chinese Web Pages
573,000,000 pages
Scalability Problem !
14
Number of Web Pages
Billions Of Textual Documents IndexedAs of Sept
2, 2003
The worlds largest search engine ?
  • 4,285,199,774 pages (Google)
  • 4.28 billion Web pages, 880 million images, and
    other documents

KEY GGGoogle, ATWAllTheWeb, INKInktomi,
TMATeoma, AVAltaVista. Source Search Engine
Watch
15
The top 10 Internet trends 2004 predicted by
eOneNet.com
  • 1.    World Internet population will continue to
    grow at an exponential rate, with China taking
    the lead in Asia having more than 100 million
    Internet users.
  • 2.    Broadband Internet penetration will
    continue to grow with China and US in the lead
    with an expected growth rate exceeding 30 each.
  • 3.    Online retail sales will still be led by
    the US with an expected revenue exceeding US80
    billion.
  • 4.    Paid search will account for the biggest
    online ad spending. With the successful paid
    search business models of Google and Overture,
    more search engines will offer paid search
    advertising.

16
The top 10 Internet trends 2004 predicted by
eOneNet.com
  • 5.    Spams will increase at least 20 despite
    the new US anti-spam law. The US legislators will
    be forced to consider amending the anti-spam law
    from an opt-out law to an opt-in law.
  • 6.    Ads placed in opt-in email newsletters will
    increase 25 as legitimate marketers find this is
    the easier way to comply with the anti-spam law
    and a better way of targeting customers.
  • 7.    Rich media will continue to be hot. More
    than 25 of online ads served will contain rich
    media contents.

17
The top 10 Internet trends 2004 predicted by
eOneNet.com
  • 8.    20 more small businesses will develop
    their own websites or use the Internet as a sales
    and marketing channel.
  • 9.    Entertainment online will be grow at a
    rapid pace, with more sites offering videos and
    digital music download services.
  • 10.    The Internet boom will revive with more
    Internet companies going for IPO both in the US
    and in Asia, in particular kicked off by the most
    anticipated Google IPO in Spring.

18
II. Inside WSE
19
Components
  • Crawler/Spider
  • Index Server
  • Query Server
  • Document Delivery

20
Architecture
(1)
(3)
SE
1B queries/day
Index
Spider
(4)
Web
Archive
Browser
SE
Index
Indexer
SE
Index
Quality results
5B pages
. Freshness
(2)
Log
.Spam
(5)
Scalable
21
Spider
  • Get all Pages from the Web
  • Web Traverse
  • Challenges
  • Performance, e.g., Pages/Per PC
  • Coverage
  • Currency
  • Spam Filtering
  • Hidden Web

22
Index Server
  • Index occurrences of all words in the pages
  • Data Cleanness
  • Challenges
  • Space Overhead,pages/PC
  • Incremental
  • Scalability Distributed Processing
  • Multiple Languages

23
System Anatomy
24
Data Structure
Lexicon fit in memory two different forms Hit
list account for most space use 2 bytes to save
space Forward index barrels are sorted by
wordID. Inside barrel, sorted by docID Inverted
Index some content as the forward index,
but sorted by wordID. doc list is sorted by
docID
25
Query Server
  • Search Relevant URLs for queries via looking up
    indices
  • Challenges
  • Speed, check queries/Per Sec
  • Functions supported
  • Localization

26
PageRank
27
PageRank (Cont.)
  • be the set of pages that point to u.
    be the number of
  • links from u and let c be a factor used for
    normalization, then
  • a simplified version of PageRank

28
Search Functions
  • Phrase search, e.g. "petite galerie"
  • Truncation, e.g. librar, womn
  • Constraining search, e.g. title"The Wall Street
    Journal"
  • Proximity search, e.g. gold near silver
  • Boolean, e.g. noir film -"pinot noir"
  • Parentheses and Nested Boolean, e.g. silver and
    not (gold or platinum)
  • Limit search, e.g. limit by date range
  • Capitalization, e.g. turkey vs. Turkey
  • Ranking fields and refine search
  • LiveTopics
  • Translate Service
  • Other

29
Document Delivery
  • Bottleneck of Bandwidth
  • Presentation
  • Caching
  • Queries, Search Results
  • Aakman Model

30
III. Business
31
What is Google?
  • Specialized web search engine
  • Founded in 1998 by 2 graduate students at
    Stanford University (Larry Page and Sergey Brin)
  • Provides a comprehensive, relevant, and
    easy-to-use web search and browsing service (free)
  • Googles features fast, unbiased, and accurate
    results, allows access to over 4 billion web
    pages, and over 800 million images (most
    important valid web pages)

32
Company Facts
Employees 1,300 Languages spoken
34 Worldwide Offices 21 (Mostly in US
Europe) Annual Revenues 900m
33
Google Revenue
  • Revenue(an e-business)
  • ½ from selling relevant text-based ads
    (sponsored links near search results)
  • ½ from licensing its search technology to
    companies like Yahoo
  • Source
  • Eric Schmidt Interview,
  • PCWorld.com (January 30, 2002)

34
Sources of Revenue
  • Adwords (150,000 advertisers) sponsored links
    ad
  • cost-per-click pricing only when people click
    on the link
  • -- Advertisement is extremely cheap and
    effective
  • i.e. Edmunds.com spent 250,000 a month in
    advertising because 1 spent generated 1.70.
  • Google Search Appliance
  • an integrated hardware/software solution that
    extends the power of Google to corporate
    intranets and web servers
  • -- Customers include Cisco Systems, Sony,
    Procter Gamble, Sun Microsystems, etc

35
Challenges (cont.)
  • Easy entry into the Search Engine Industry
  • Lack of customer lock-in (vs. Microsoft)
  • Google will focus on creating services to
    voluntarily draw in customers
  • Large, well-known competitors are focusing on
    in-house search technology (Yahoo, Microsoft,
    AOL, eBay, Amazon)
  • Customers are becoming competitors (Yahoo, AOL)

36
Competitors Ebay and Amazon
  • Ebay (www.ebay.com) E-commerce
  • Web-based marketplace in which a community of
    buyers and sellers are brought together to
    browse, buy and sell various items
  • -- Business revenue Charges Proceeds (Fees)
  • (5) 0.01-25 (2.5) 25-1000 (1.25) over
    1000
  • Amazon (www.amazon.com) E-commerce
  • a customer-centric company that sells a range
    of products that it purchases from manufacturers
    and distributors

37
Competitors Microsoft and Yahoo
  • Microsoft is developing its own search engine
  • -- Can lasso users into its search engine
    through its operating system
  • -- Has the braniacs to implement top of the
    line search engine technology
  • Yahoo was customer of Google (may now become
    Googles biggest competitor)
  • -- Offers placement under sponsored links and
    within actual results (unethical)

38
IV. Impacts
39
Impacts
  • Web Computing
  • Knowledge Windows
  • New Web OS

40
Web Computing
  • Faster than local search
  • Very-large scale of computing systems
  • Realize global users behaviors
  • Acquire global information sources

41
Web Computing
  • Local disc or global disc?
  • Personal information management?
  • Gmails
  • Photo search

42
Knowledge Windows
  • Windows of Information Search
  • Alliance with online databases
  • Windows of Personal Knowledge Management
  • Knowledge Windows

43
New Web OS
  • Merged with Linux OS
  • Software download from end-users
  • Information Service OS

44
V. New Gen. of WSE
45
Advanced Google
  • Is Google good enough?
  • Takano
  • Takano NII
  • Takano NII Japan
  • More about Google Services
  • http//www.google.com/options/

46
New Features in Google
  • Google Labs http//labs.google.com/
  • Google Desktop Search
  • Searching text, Web, Word, Excel, PowerPoint,
    Outlook, AOL Instant Messenger
  • Google SMS
  • Searching phone book, dictionary, product prices,
  • Google Print
  • Searching books

47
(No Transcript)
48
Other Search Tools
  • A9.com (by Amazon)
  • Bookmark, history, discover, diary
  • Books, movies,
  • Clusty.com (by Vivisimo)
  • Clustering engine
  • Snap.com (by Idealab)
  • Sorting by popularity, satisfaction, Web
    popularity, Web satisfaction, domain,
  • Alexa.com (by Amazon)
  • Average user review ratings,
  • Others Yahoo, AskJeeves, AOL Search, HotBot,
    MSN, Netscape, Lycos, Altavista, LookSmart,
    Gigablast, Overture, About, FindWhat, Teoma,
    InformSearch,

49
Clusty.com
50
Example on Vivisimo
51
Vivisimo (cont.)
52
New Directions
  • Personalization
  • Photo search, email search filtering
  • Information Extraction
  • EX Scholar search
  • Information Agent
  • Deep Web Search

53
VI. Web Mining
54
Web Search/Information Retrieval
Millions of Users
55
Improving Search via Mining
Millions of Users
56
Valuable Web Resources
Knowledge Discovery
Hyper Links Anchor Texts Search Result
Pages Query Logs Query Session Logs Clicked
Stream Logs Deep Web, .
Web logs, texts, images,
Millions of Users
57
Discovered Knowledge
Knowledge Discovery
Users Preferences/Need Topic, Location,
Timing, Authority/Popularity Site, File,
People, Company, Product Clusters/Associations
/ Relations Site, Page, People,
Company, Product, Query
Web logs, texts, images,
Millions of Users
58
Web Mining for IR
Knowledge Discovery
Search Classification Clustering Cross-language
IR Information Extraction Text mining Filtering
Web logs, texts, images,
Millions of Users
59
  • CS 276 / LING 239IInformation Retrieval and Web
    Mining
  • Prabhakar Raghavan and Hinrich Schütze
  • Course Description
  • Basic and advanced techniques for text-based
    information systems efficient text indexing
    Boolean, vector space, and probabilistic
    retrieval models evaluation and interface
    issues Web search including crawling, link-based
    algorithms, and Web metadata text/Web
    clustering, classification, wrapper, information
    extraction, and collaborative filtering systems
    text mining. Projects can be chosen from diverse
    topics in information retrieval.

60
Computational Linguistics, 29 , Issue 3,
September 2003 .
61
Research at Web Knowledge Discovery Lab
62
Research at Web Knowledge Discovery Lab
  • Live series
  • LiveTrans
  • SIGIR04, ACL04, JCDL04
  • ACM Trans. On Information System, 2004
  • Online Translation of unknown queries via Web
  • LiveClassifier
  • WWW04, IJCNLP04
  • ACM Trans. on ALIP, 2004
  • Training classifiers and classifying short text
    via Web

63
Research at Web Knowledge Discovery Lab
  • LiveCluster
  • CIKM04
  • ACM Trans. On Information System, 2004
  • Generating taxonomy from terms or documents

64
LiveTrans Cross-language Web Search
65
LiveClassifier Classifying search results into
user-defined classification tree
66
LiveClassifier Paper Title Categorization
Note no labeled training data
67
LiveCluster Taxonomy Generation
68
Terms Clustering
69
Query Clustering
70
(No Transcript)
71
Outline
  • Translating Unknown Queries (SIGIR04)
  • Training Text Classifiers (WWW04)
  • Generating Taxonomy/Topic Hierarchies (TOIS04)

72
Translating Unknown Queries
  1. Anchor Text Mining
  2. Probabilistic Modeling (ACM TALIP02)
  3. Transitive Translation (ACM TOIS04)
  4. Search-Result Page Mining
  5. Translation Extraction Selection (JCDL04)
  6. CLIR Other Applications (SIGIR04, ACL04)

Note First work dealing with online translation
73
Introduction (cont.)
  • Bottleneck of CLIR service
  • Real queries are often short
  • Out-of-dictionary terms
  • and might have local variations
  • Ex proper nouns, new terminologies,
  • Need for a powerful query translation engine
  • Up-to-date dictionary

English Terminologies Chinese Translation
Digital library ?????/?????
Banff ??/??
Ishikawa ???
NII Japan ????????
louvre museum ???
SARS ??????????/??/??
Clinton ???/???
Bill Gates ????
74
Web Mining of Query Translations
Source Term
TargetTranslations
TermTranslation
OOD
Yahoo lt-gt ??
Web Mining
Anchor-Text Mining
Search-Result Mining
  • Different problems for different resources

75
Anchor Text (Yahoo lt-gt ??)
  • Applies to most languages
  • Translation candidates are likely to appear in
    the same anchor-text-set

76
Search Result Page (National Palace Museum vs.
?????)
  • Mixed-language characteristic in Chinese pages

77
Problems
  • Term extraction
  • Translation selection noisy reduction
  • Language pairs with limited corpora
  • Processing speed
  • Data cleanness (language identification)
  • Language independence

78
Term Extraction SCPCD
79
Term Selection Probabilistic Inference Model
  • Integrating anchor texts and link structures into
    probabilistic inference model
  • Based on co-occurrence page authority

Page Authority
Co-occurrence


Page Rank
80
Observation of Anchor Text
81
Observation of Anchor Text
www.yahoo.com
www.yahoo.com.tw
Source Query
Taiwan -
Yahoo
- in USA
Yahoo
82
Observation of Anchor Text
www.yahoo.com
www.yahoo.com.tw
Translation Candidates

????
??
Taiwan -
Yahoo
- in USA
?? -
??
Yahoo
Anchor-Text Set
83
Observation of Anchor Text
www.yahoo.com
www.yahoo.com.tw
Page Authority
Co-occurrence
(in-link 187)
(in-link 21)

????
??
Taiwan -
Yahoo
- in USA
?? -
??
Yahoo
84
Search Result Mining
PAT-tree based term extraction method Chien,
SIGIR 97
Term Extraction
Search Engine
Source Query
Web Pages
Target Translations
Term Selection
85
Term Selection
  • How to decide the ranking?
  • S, Ti frequently co-occur in the same pages
  • Not necessarily true for synonyms and antonyms
  • S, Ti the result pages containing similar
    co-occurring context terms as feature vectors

Query S
86
Chi-Square Test
  • Chi-Square Test a statistical method for
    co-occurrence analysis Gale Church 91

a of pages containing both terms s and t b
of pages containing term s but not t c of
pages containing term t but not s d of pages
containing neither term s nor t N the total
number of pages, i.e., N abcd
87
Context Vector Analysis
  • Context Vector Analysis co-occurring context
    terms as feature vectors
  • Similarity measure cosine measure

88
Indirect Association Problem
89
Competitive Linking Algorithm
s
system
t1
?? (Cisco)
?? (system)
Cisco
t2
St1
?? (information)
?? (network)
St2
?? (computer)
Fig. 6. An illustration showing a bipartite graph
generated by using Algorithm 2.

90
Combined Method
  • To take advantage of both methods
  • Anchor-text-based higher precision
  • Search-result-based higher coverage

Rm(s,t) Ranking of score in different methods
91
Experiments
  • Performance on Query Translation
  • Test Bed real query terms from the Dreamer
    search engine log in Taiwan
  • 228,566 unique terms, during a period of 3 months
    in 1998
  • Random-query test set
  • 50 query terms in Chinese, randomly selected from
    the top 20,000 queries in the log
  • 40 of them were out-of-dictionary

92
Random Query Test Set
Table 2. Coverage and top 15 inclusion rates obtained with the four different methods for the random-query set. Table 2. Coverage and top 15 inclusion rates obtained with the four different methods for the random-query set. Table 2. Coverage and top 15 inclusion rates obtained with the four different methods for the random-query set. Table 2. Coverage and top 15 inclusion rates obtained with the four different methods for the random-query set. Table 2. Coverage and top 15 inclusion rates obtained with the four different methods for the random-query set.
Method Top-1 Top-3 Top-5 Coverage
CV 40.0 54.0 54.0 68
X2 36.0 50.0 52.0 68
AT 20.0 32.0 32.0 32
Combined 44.0 64.0 66.0 72
  • Many query terms didnt appear in anchor-text
    sets (coverage)

93
Other Experiments
  • 430 popular Chinese queries, 67.4 top-1
    inclusion rate
  • Common terms randomly selected 100 common nouns
    and 100 common verbs from general-purpose Chinese
    dictionary

94
Transitive Translation
95
Transitive Translation Model
96
Chinese-Japanese Translation
Table VII. Selected source query terms in traditional Chinese and their translations in English, simplified Chinese and Japanese respectively, which were extracted by our transitive translation model. Table VII. Selected source query terms in traditional Chinese and their translations in English, simplified Chinese and Japanese respectively, which were extracted by our transitive translation model. Table VII. Selected source query terms in traditional Chinese and their translations in English, simplified Chinese and Japanese respectively, which were extracted by our transitive translation model. Table VII. Selected source query terms in traditional Chinese and their translations in English, simplified Chinese and Japanese respectively, which were extracted by our transitive translation model.
Source terms (Traditional Chinese) Extracted target translations Extracted target translations Extracted target translations
Source terms (Traditional Chinese) English Simplified Chinese Japanese
?? ?? ??? ?? ???? ?? ?? ?? ??? ?? Sony Nike Stanford Sydney internet network homepage computer database information ?? ?? ??? ?? ??? ?? ?? ??? ??? ?? ??? ??? ??????? ???? ??????? ?????? ?????? ??????? ?????? ?????????
Table VI. Top-n inclusion rates obtained with different models for translating traditional Chinese queries into Japanese. Table VI. Top-n inclusion rates obtained with different models for translating traditional Chinese queries into Japanese. Table VI. Top-n inclusion rates obtained with different models for translating traditional Chinese queries into Japanese. Table VI. Top-n inclusion rates obtained with different models for translating traditional Chinese queries into Japanese. Table VI. Top-n inclusion rates obtained with different models for translating traditional Chinese queries into Japanese. Table VI. Top-n inclusion rates obtained with different models for translating traditional Chinese queries into Japanese.
Model Top1 Top2 Top3 Top4 Top5
Direct 10.5 12.8 14.3 15.1 15.1
Indirect 40.2 49.4 56.6 58.6 59.6
Transitive 42.9 51.4 58.6 61.3 61.9

97
Translation Lexicons with Regional Variations
(a) Taiwan
(b) Mainland China
(c) Hong Kong Figure
1 Examples of search-result pages in different
Chinese regions that were obtained via the
English query words George Bush from Google.
98
Summary
  • A work dealing with live translation of unknown
    queries
  • Anchor-text-based
  • High precision for high-frequency terms
  • Effective for proper nouns in multiple languages
  • Not applicable if size of anchor-text set not
    enough
  • Search-result-based
  • Exploit rich Web resources
  • High coverage for English-Chinese language pair

99
  • LiveCluster
  • Generating Taxonomy from terms or documents

100
  • Taxonomy Generation from Terms

101
Hierarchical Query Clustering
102
The Steps
  • Feature Extraction
  • Use co-occurred seed terms extracted from
    retrieved top pages
  • Term Vector
  • Each query term is assigned a term vector
  • Record the co-occurred feature terms and their
    frequency values in the retrieved documents.
  • Term Similarity
  • tfidf-based Cosine measurement
  • Hierarchical Term Clustering
  • Cluster popular query terms in the log into
    initial categories
  • Query terms with similar features are grouped
    into clusters.

103
Feature Extraction
  • Use co-occurred seed terms extracted from
    retrieved top pages

nude
Co-occurred feature terms
Creative Nude Photography Network -- Fine Art
Nude and ... ... The Creative Nude and Erotic
Photography Network is the number one net portal
to the best in fine art nude and erotic
photography! Over 100 CNPN Member Sites ...
Nude Places... to be naked. Walking in the
forest, cruising the lake in open boats,
swimming, picnicking and nude photography are all
enjoyed in the nude. 60 minutes 39.95. ... A
Brave Nude World... A Brave Nude World! Warning
This site contains links to fine art nude
erotic photography. If you are under 18 or do not
wish to view this material, You can ...
tf/df
term
3/2
erotic photography
1/1
naked
2/2
photography
3/2
art



104
Term Weighting
105
Extraction of Basic Feature Terms
  • Performance of different features randomly
    selected, hi-frequency, and seed terms
  • Popular queries not affected by ephemeral trends,
    e.g., movie, basketball, mutual fund, etc.
  • More expressive and distinguishable in describing
    a particular category
  • Two logs compared and extracted 9,709 overlapping
    top query terms as feature terms

106
Task I Query Clustering (Cont.)
  • Feature Extraction
  • Use co-occurred seed terms extracted from
    retrieved top pages
  • Term Vector
  • Each query term is assigned a term vector
  • Record the co-occurred feature terms and their
    frequency values in the retrieved documents.
  • Term Similarity
  • TF IDF-based Cosine measurement
  • Hierarchical Term Clustering
  • Cluster popular query terms in the log into
    initial categories
  • Query terms with similar features are grouped
    into clusters.

107
Term Similarity
108
Hierarchical Term Clustering
  • Agglomerative hierarchical clustering (AHC)
  • Compute the similarity between all pairs of
    clusters
  • Estimate similarity between all pairs of composed
    terms
  • Use the lowest term similarity value as the
    cluster similarity value
  • Merge the most similar (closest) two clusters
  • Complete linkage method
  • Update the cluster vector of the new cluster
  • Repeat steps 2 and 3 until only a single cluster
    remains

109
(No Transcript)
110
Clustering Results
111
Cluster Partition
112
Quality Function
113
Quality Function (Cont.)
114
Quality Function (Cont.)
115
Preliminary Experiment
  • Test queries
  • Two sets top 1k queries and random 1k queries
  • Each of the test queries has been manually
    assigned according classes
  • Evaluation metrics
  • F-Measure

116
Evaluation F-Measure
117
Obtained F-Measures
118
(No Transcript)
119
Results of Hierarchical Structure Generation
Write a Comment
User Comments (0)
About PowerShow.com