Text Analysis VI

About This Presentation

Title:

Text Analysis VI

Description:

Automatically extract unstructured text data from Web pages ... Michael Jordan will start Monday's tutorial, followed by Raymond J. Mooney and Tom Mitchell ... – PowerPoint PPT presentation

Number of Views:46

Avg rating:3.0/5.0

Slides: 15

Provided by: anna207

Category:

more less

Transcript and Presenter's Notes

Title: Text Analysis VI

1
Text Analysis (VI)
2
Outline

Indexing
Lexical processing
Content-based ranking
Probabilistic retrieval

Latent semantic analysis
Document clustering
Text categorization
Information extraction

3
Information Extraction

Automatically extract unstructured text data from
Web pages
Represent extracted information in some
well-defined schema
Examples

4
Information Extraction

Crawl the Web searching for information about
certain technologies or products of interest
Extract information on authors and books from
various online bookstore and publisher pages
Extract job information, title, duties,
requirements, location,

5
Info Extraction

As a task Filling slots in a database from
sub-segments of unstructured text/web documents
As a family of techniques IE segmentation
classification association clustering

6
Single Entity Extraction

Michael Jordan will start Mondays tutorial,
followed by Raymond J. Mooney and Tom Mitchell
Michael Jordan will start Mondays tutorial,
followed by Raymond J. Mooney and Tom Mitchell
Precision 2/6, recall 2/3, F1 4/9

7
Info Extraction as Classification

Represent each document as a sequence of words
Use a sliding window of width k as input to a
classifier
each of the k inputs is a word in a specific
position

8
Info Extraction as Classification

The system trained on positive and negative
examples (typically manually labeled)
Limitation no account of sequential constraints
e.g. the author field usually precedes the
address field in the header of a research paper
can be fixed by using stochastic finite-state
models

9
Hidden Markov Models
Example Classify short segments of text in terms
whether they correspond to the title, author
names, addresses, affiliations, etc.
10
Hidden Markov Model

Each state corresponds to one of the fields that
we wish to extract
e.g. paper title, author name, etc.
Each state has a characteristic probability
distribution over the set of all possible words
e.g. specific distribution of words from the
state title

11
Hidden Markov Model

True Markov state diagram is unknown at
parse-time
can see noisy observations from each state - the
sequence of words from the document

12
Training HMM

Given a sequence of words and HMM
parse the observed sequence into a corresponding
set of inferred states
Viterbi algorithm

13
Training HMM

Can be trained
in supervised manner with manually labeled data
bootstrapped using a combination of labeled and
unlabeled data

14
More advanced methods

MEMM Maximum Entropy Markov Models
CRF Condition Random Fields
Cohen, W. W. and McCallum, A. 2002 Information
Extraction from the World Wide Web. Tutorial
presented at NIPS-15http//www.cs.umass.edu/7emc
callum/papers/nips-ie-tutorial.ps

Write a Comment

User Comments (0)