Information Extraction Research @ Yahoo! Labs Bangalore - PowerPoint PPT Presentation

Loading...

PPT – Information Extraction Research @ Yahoo! Labs Bangalore PowerPoint presentation | free to download - id: 72f33e-MmVmN



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Information Extraction Research @ Yahoo! Labs Bangalore

Description:

Information Extraction Research _at_ Yahoo! Labs Bangalore Rajeev Rastogi Yahoo! Labs Bangalore – PowerPoint PPT presentation

Number of Views:28
Avg rating:3.0/5.0
Slides: 31
Provided by: Chuck152
Learn more at: http://www.isical.ac.in
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Information Extraction Research @ Yahoo! Labs Bangalore


1
Information Extraction Research _at_ Yahoo! Labs
Bangalore
Rajeev RastogiYahoo! Labs Bangalore
2
The most visited site on the internet
  • 600 million users per month
  • Super popular properties
  • News, finance, sports
  • Answers, flickr, del.icio.us
  • Mail, messaging
  • Search

3
Unparalleled scale
  • 25 terabytes of data collected each day
  • Over 4 billion clicks every day
  • Over 4 billion emails per day
  • Over 6 billion instant messages per day
  • Over 20 billion web documents indexed
  • Over 4 billion images searchable

No other company on the planet processes as much
data as we do!
4
Yahoo! Labs Bangalore
  • Focus is on basic and applied research
  • Search
  • Advertizing
  • Cloud computing
  • University relations
  • Faculty research grants
  • Summer internships
  • Sharing data/computing infrastructure
  • Conference sponsorships
  • PhD co-op program

5
What does search look like today?
6
Search results of the future Structured abstracts
yelp.com
Gawker
babycenter
New York Times
epicurious
LinkedIn
answers.com
webmd
7
Search results of the future Intelligent ranking
8
A key technology for enabling search
transformation
  • Information extraction (IE)

9
Information extraction (IE)
  • Goal Extract structured records from Web pages

Name
Category
Address
Map
Phone
Price
Reviews
10
Multiple verticals
  • Business, social networking, video, .

11
One schema per vertical
12
IE on the Web is a hard problem
  • Web pages are noisy
  • Pages belonging to different Web sites have
    different layouts

Noise
13
Web page types
  • Template-based

Hand-crafted
14
Template-based pages
  • Pages within a Web site generated using scripts,
    have very similar structure
  • Can be leveraged for extraction
  • 30 of crawled Web pages
  • Information rich, frequently appear in the top
    results of search queries
  • E.g. search query Chinese Mirch New York
  • 9 template-based pages in the top 10 results

15
Wrapper Induction
  • Enables extraction from template-based pages

Learn
Sample pages
Annotations
Website pages
Annotate Pages
Learn Wrappers
Sample
Apply wrappers
XPath Rules
Extract
Extract
Website pages
Records
16
Example
Generalize
XPath /html/body/div/div/div/div/div/div/span
/html/body//div//span
17
Filters
  • Apply filters to prune from multiple candidates
    that match XPath expression

XPath /html/body//div//span
Regex Filter (Phone)(0-93) 0-93-0-94
18
Limitations of wrappers
  • Wont work across Web sites due to different page
    layouts
  • Scaling to thousands of sites can be a challenge
  • Need to learn a separate wrapper for each site
  • Annotating example pages from thousands of sites
    can be time-consuming expensive

19
Research challenge
  • Unsupervised IE Extract attribute values from
    pages of a new Web site without annotating a
    single page from the site
  • Only annotate pages from a few sites initially as
    training data

20
Conditional Random Fields (CRFs)
  • Models conditional probability distribution of
    label sequence yy1,,yn given input sequence
    xx1,,xn
  • fk features, lk weights
  • Choose lk to maximize log-likelihood of training
    data
  • Use Viterbi algorithm to compute label sequence y
    with highest probability

21
CRFs-based IE
  • Web pages can be viewed as labeled sequences
  • Train CRF using pages from few Web sites
  • Then use trained CRF to extract from remaining
    sites

22
Drawbacks of CRFs
  • Require too many training examples
  • Have been used previously to segment short
    strings with similar structure
  • However, may not work too well across Web sites
    that
  • contain long pages with lots of noise
  • have very different structure

23
An alternate approach that exploits site knowledge
  • Build attribute classifiers for each attribute
  • Use pages from a few initial Web sites
  • For each page from a new Web site
  • Segment page into sequence of fields (using
    static repeating text)
  • Use attribute classifiers to assign attribute
    labels to fields
  • Use constraints to disambiguate labels
  • Uniqueness an attribute occurs at most once in a
    page
  • Proximity attribute values appear close together
    in a page
  • Structural relative positions of attributes are
    identical across pages of a Web site

24
Attribute classifiers constraints example
Chinese Mirch
Chinese, Indian
120 Lexington AvenueNew York, NY 10016
(212) 532 3663
Page1
Phone
Category
Name
Address
Jewel of India
Indian
15 W 44th StNew York, NY 10016
(212) 869 5544
Page2
Category
Name
Phone
Address
21 Club
American
21 W 52nd StNew York, NY 10019
(212) 582 7200
Page3
Phone
Category, Name
Name, Noise
Address
Uniqueness constraint NamePrecedence
constraint Name lt Category
21 Club
Page3
American
21 W 52nd StNew York, NY 10019
(212) 582 7200
Phone
Category
Name
Address
25
Performance evaluation Datasets
  • 100 pages from 5 restaurant Web sites with very
    different structure
  • www.citysearch.com
  • www.fromers.com
  • www.nymag.com
  • www.superpages.com
  • www.yelp.com
  • Extract attributes Name, Address, Phone num,
    Hours of operation, Description

26
Methods considered
  • CRFs, attribute classifiers constraints
  • Features
  • Lexicon Words in the training Web pages
  • Regex isAlpha, isAllCaps, isNum, is5DigitNum,
    isDay,
  • Attribute-level Num of words, Overlap with
    title,

27
Evaluation methodology
  • Metrics
  • Precision, recall, F1 for attributes
  • Test on one site, use pages from remaining 4
    sites as training data
  • Average measures over all 5 sites

28
Experimental results
Precision
Recall
CRF Constraint CRF Constraint
Name .39 1 .34 1
Phone .02 1 .2 .99
Address .01 .81 .16 .83
Hours .22 1 .36 1
Desc .13 .25 0 .15
Overall .15 .81 .21 .76
29
Other IE scenarios Browse page extraction
Similar-structuredrecords
30
IE big picture/taxonomy
  • Things to extract from
  • Template-based, browse, hand-crafted pages, text
  • Things to extract
  • Records, tables, lists, named entities
  • Techniques used
  • Structure-based (HTML tags, DOM tree paths)
    e.g. Wrappers
  • Content-based (attribute values/models) e.g.
    dictionaries
  • Structure Content (sequential/hierarchical
    relationships among attribute values) e.g.
    hierarchical CRFs
  • Level of automation
  • Manual, supervised, unsupervised
About PowerShow.com