EECS 395495: Web Information Retrieval and Extraction - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

EECS 395495: Web Information Retrieval and Extraction

Description:

Florida gators football depth chart 2002. Many results -- none correct in top 10 ... 'Florida gators football depth chart 2002' Search engine: ... – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 37
Provided by: douglas58
Category:

less

Transcript and Presenter's Notes

Title: EECS 395495: Web Information Retrieval and Extraction


1
EECS 395/495 Web Information Retrieval and
Extraction
  • Spring 2008

2
Outline
  • Introductions
  • Course goals and logistics
  • Why take this class?
  • What is Web Search?
  • What will it be in five years? Ten years?

3
Introductions
  • Professor Doug Downey
  • TA Junsong Yuan

4
Goals
  • How does Web search work?
  • First half of course
  • Whats the future of Web search?
  • Second half of course

5
Logistics
  • First half the class lectures
  • Second half presentations of research papers
  • Groups lead discussions
  • Each individual presents a handful of slides
  • See Web page and sign up (e-mail TA) this week

6
Logistics
  • Grading
  • Participation (30)
  • During lectures/discussions (10)
  • Leading a discussion (15)
  • Search feature pitch (5)
  • Projects (70)
  • Project grade, based on individual contribution
    (60)
  • Review of another teams project (10)

7
Projects
  • Groups of 2-3 (not necessarily discussion
    groups)
  • Examples
  • Read 3-5 recent research papers and summarize the
    state of the art in some area of IR/IE
  • Implement an IR/IE system and report on the
    results
  • Answer theoretical questions
  • Etc.

8
Specific Examples
  • Systems
  • Automatically place blogs on liberal/conservative
    continuum
  • Create a search knob for specifying reading
    level
  • Execute relevant background searches as I write
    an e-mail
  • Theoretical questions
  • Read a paper on search ad auctions or PageRank
    and attempt to extend the results
  • Summary of Research
  • Read 5 papers on automated question answering
    summarize the field and suggest future directions

9
Project Milestones
  • April 8 Proposal (1 page)
  • Meetings with me to finalize April 9/10
  • lots of progress
  • May 5 Report of preliminary results (3 pages)
  • May 7 Review group provides feedback (1 page)
  • June 2 Final Report (4 pages)
  • June 4/June 8 (finals week) Final Presentations
    (10 mins 5 min for QA)

10
Search Feature pitch
  • This Thursday (!)
  • Individually deliver a 2.5 minute pitch of a
    new search engine feature
  • Plausible, not necessarily possible(see example
    later)
  • Send me your ppt/pdf slides by Thursday at 9AM
  • Class discusses for one minute while next
    presenter sets up

11
Outline
  • Introductions
  • Course goals and logistics
  • Why take this class?
  • What is Web Search?
  • What will it be in five years? Ten years?

12
Who cares about search?
  • "The most important application for the
    foreseeable future...is search.
  • Steve Ballmer, CEO of MicrosoftFinancial Times
    http//news.cnet.com/8301-10784_3-9973650-7.html
  • Why?
  • Searchs utility scales as the Web scales
  • People use it all the time
  • Control
  • Profit

13
Why should you care about search?
  • Opportunity
  • Fascinating important enabling technologies
  • Scaling
  • Machine Learning/Data Mining
  • Graph-based algorithms
  • Language Understanding
  • Auction theory
  • User Interfaces

14
Graph-based Algorithms
  • Example PageRank
  • Googles original claim to fame
  • Idea Quality of p is proportional to the
    aggregate quality of the pages linking to p
  • PageRank(p) probability that a random surfer
    lands on p
  • Pick a starting page at random
  • Follow links uniformly at random
  • Every now and then, jump to a random page
  • PageRank(p) proportion of visits to p

15
PageRank example
15 probability of a random jump
C
B
A
F
D
E
16
  • Wikipedia.com

17
How to compute PageRank
  • Simulate a random surfer?
  • On 20 billion pages and 400 billion hyperlinks
  • It can be done
  • Youll learn how
  • What about
  • Personalized PageRank? Link spam?

18
What do you need to take this class?
  • EECS 311
  • basic understanding of algorithms and data
    structures
  • Basics of linear algebra and probability theory
  • Tolerance for non-linearity
  • Willingness to participate

19
Outline
  • Introductions
  • Course goals and logistics
  • Why take this class?
  • What is Web Search?
  • What will it be in five years? Ten years?

20
Web Search today
  • Performs an easy task
  • Extremely quickly
  • At massive scale
  • Relatively well

21
Easy?
  • Belief Web Search engines have to understand my
    query and find a needle in a haystack of 15
    billion documents!
  • Reality Most search queries are
  • short (avg. 2.5 words 2005)
  • satisfied by pages from a small subset of the Web
  • Millions rather than billions Mei et al., 2008
  • gt More like finding a pencil in a haystack!

22
Extremely Quickly
  • Results returned in lt 1 sec
  • For almost any query
  • For any engine
  • It was not always thus!
  • 3-4 seconds in early days (Chu Rosenthal,
    1996 Garratt et al. 2001)
  • How? Inverted indices, enormous data centers,
    clever algorithms

23
Users are quick too
Downey, Dumais, Horvitz 2007
24
At Massive Scale
  • (in millions)
  • How? Inverted indices, enormous data centers,
    clever algorithms

http//blog.searchenginewatch.com/comscoresearchsh
arefeb2009_0309.jpg
25
Relatively Well
26
  • Note, for images missing from this online
    version see links at bottom of page

http//www.seoresearcher.com/distribution-of-click
s-on-googles-serps-and-eye-tracking-analysis.htm
27
Web Search Engines today
  • For the most part
  • Perform an easy task
  • Extremely quickly
  • At massive scale
  • Relatively well
  • The future of Web search is in more difficult
    tasks

28
Trend toward more difficult queries
  • Search queries are getting longer.

http//www.readwriteweb.com/archives/hitwise_searc
h_queries_are_getting_longer.php
29
Things you cant do with search today
  • Query by description rather than content
  • Humorous anecdotes about Perry Farrell
  • Extracting and synthesizing over multiple pages
  • Nanotechnology companies hiring on the West Coast
  • Substances the FDA has banned
  • Organizing bodies of documents
  • Show me the most compelling cases for and against
    the recent stimulus package
  • Who says drinking from aluminum soda cans
    increases my Alzheimer's risk? Should I believe
    them?

30
Image Search for famous pink building
31
Search Feature Pitch
  • Guidelines
  • Introduce yourself (name, major, degree, year)
  • Whats the problem
  • Whats your solution
  • What makes it feasible
  • 2 1/2 minutes isnt a lot of time!!

32
Mechanism for over-specified queries
  • Florida gators football depth chart 2002
  • 0 results
  • Florida gators football depth chart 2002
  • Many results -- none correct in top 10
  • After protracted manual reformulation process
  • 2002 gator football media guide
  • 8 results, top several correct

33
Proposed Solution
  • User submits over-specified query
  • one with zero hits, e.g.Florida gators football
    depth chart 2002
  • Search engine
  • Tries moving quotes, substituting phrases, etc.
  • Florida gators football depth chart 2002
  • Florida gators depth chart 2002 pigskin
  • After a few minutes, returns summary of results

34
Why plausible?
  • Substitutions can be obtained from thesauri or
    statistical co-occurrence, e.g.
  • P(depth chart football media guide) is
    largegt try substituting football media guide
    for depth chart in query
  • Given a few minutes, we can scalably execute
    several hundred candidate queries

35
Reminder
  • Your search engine feature pitch
  • Presented this Thursday in class
  • Send ppt or pdf to Junsong I by 9AM Thurs
  • See course Web page
  • Also
  • Start forming project groups
  • Sign up for team mtg with me next Thurs/Fri
  • Look for discussion papers dates later today

36
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com