Text Analysis VI - PowerPoint PPT Presentation

1 / 14
About This Presentation
Title:

Text Analysis VI

Description:

Automatically extract unstructured text data from Web pages ... Michael Jordan will start Monday's tutorial, followed by Raymond J. Mooney and Tom Mitchell ... – PowerPoint PPT presentation

Number of Views:46
Avg rating:3.0/5.0
Slides: 15
Provided by: anna207
Category:

less

Transcript and Presenter's Notes

Title: Text Analysis VI


1
Text Analysis (VI)
2
Outline
  • Indexing
  • Lexical processing
  • Content-based ranking
  • Probabilistic retrieval
  • Latent semantic analysis
  • Document clustering
  • Text categorization
  • Information extraction

3
Information Extraction
  • Automatically extract unstructured text data from
    Web pages
  • Represent extracted information in some
    well-defined schema
  • Examples

4
Information Extraction
  • Crawl the Web searching for information about
    certain technologies or products of interest
  • Extract information on authors and books from
    various online bookstore and publisher pages
  • Extract job information, title, duties,
    requirements, location,

5
Info Extraction
  • As a task Filling slots in a database from
    sub-segments of unstructured text/web documents
  • As a family of techniques IE segmentation
    classification association clustering

6
Single Entity Extraction
  • Michael Jordan will start Mondays tutorial,
    followed by Raymond J. Mooney and Tom Mitchell
  • Michael Jordan will start Mondays tutorial,
    followed by Raymond J. Mooney and Tom Mitchell
  • Precision 2/6, recall 2/3, F1 4/9

7
Info Extraction as Classification
  • Represent each document as a sequence of words
  • Use a sliding window of width k as input to a
    classifier
  • each of the k inputs is a word in a specific
    position

8
Info Extraction as Classification
  • The system trained on positive and negative
    examples (typically manually labeled)
  • Limitation no account of sequential constraints
  • e.g. the author field usually precedes the
    address field in the header of a research paper
  • can be fixed by using stochastic finite-state
    models

9
Hidden Markov Models
Example Classify short segments of text in terms
whether they correspond to the title, author
names, addresses, affiliations, etc.
10
Hidden Markov Model
  • Each state corresponds to one of the fields that
    we wish to extract
  • e.g. paper title, author name, etc.
  • Each state has a characteristic probability
    distribution over the set of all possible words
  • e.g. specific distribution of words from the
    state title

11
Hidden Markov Model
  • True Markov state diagram is unknown at
    parse-time
  • can see noisy observations from each state - the
    sequence of words from the document

12
Training HMM
  • Given a sequence of words and HMM
  • parse the observed sequence into a corresponding
    set of inferred states
  • Viterbi algorithm

13
Training HMM
  • Can be trained
  • in supervised manner with manually labeled data
  • bootstrapped using a combination of labeled and
    unlabeled data

14
More advanced methods
  • MEMM Maximum Entropy Markov Models
  • CRF Condition Random Fields
  • Cohen, W. W. and McCallum, A. 2002 Information
    Extraction from the World Wide Web. Tutorial
    presented at NIPS-15http//www.cs.umass.edu/7emc
    callum/papers/nips-ie-tutorial.ps
Write a Comment
User Comments (0)
About PowerShow.com