Extracting Personal Names from Email: Applying Named Entity Recognition to Informal Text - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

Extracting Personal Names from Email: Applying Named Entity Recognition to Informal Text

Description:

Game: Each set formed by different days during the simulation period ... Train Conditional Random Fields (CRF) to label and extract personal names ... – PowerPoint PPT presentation

Number of Views:66
Avg rating:3.0/5.0
Slides: 21
Provided by: scie300
Category:

less

Transcript and Presenter's Notes

Title: Extracting Personal Names from Email: Applying Named Entity Recognition to Informal Text


1
Extracting Personal Names from Email Applying
Named Entity Recognition to Informal Text
  • Einat Minkov Richard C. Wang
  • Language Technologies Institute

William W. Cohen Center for Automated Learning
and Discovery
School of Computer Science Carnegie Mellon
University
2
What is an informal text?
  • A text that is
  • Written for a narrow audience
  • Group/task-specific abbreviations often used
  • Not self-contained (context shared by related
    people)
  • Not carefully prepared
  • Contains grammatical and spelling errors
  • Does not follow capitalization conventions
  • Some examples are
  • Instant messages
  • Newsgroup postings
  • Email messages

3
Objective / Outline
  • Investigate named entity recognition (NER) for
    informal text
  • Conduct experiments on recognizing personal names
    in email
  • Examine indicative features in email and newswire
  • Suggest specialized features for email
  • Evaluate performance of a state-of-the-art
    extractor (CRF)
  • Analyze repetition of names in email and newswire
  • Suggest and evaluate a recall-enhancing method
    that is effective for email

4
Corpora
  • Mgmt corpora Emails from a management course at
    CMU in which students form teams to run simulated
    companies
  • Teams Each set (train/tune/test) formed by
    different simulation teams
  • Game Each set formed by different days during
    the simulation period
  • Enron corpora Emails from Enron Corporation
  • Meetings Each set formed by randomly selected
    meeting-related emails
  • Random Each set formed by repeatedly sampling a
    user then sampling an email from that user, both
    at random

5
Extraction Method
  • Train Conditional Random Fields (CRF) to label
    and extract personal names
  • A machine-learning based probabilistic approach
    to labeling sequences of examples
  • Learning reduces NER to the task of tagging, or
    classifying, each word using a set of five tags
  • Unique A one-token entity
  • Begin The first token of a multi-token entity
  • End The last token of a multi-token entity
  • Inside Any other token of a multi-token entity
  • Outside A token that is not part of an entity
  • Example

Einat and Richard Wang met William W.
Cohen today
Unique Outside Begin End
Outside Begin Inside End
Outside
6
Top Learned Features
  • Features most indicative of a token being part
    of a name in a Conditional Random Fields (CRF)
    extractor

Newswire (MUC-6)
Email (Mgmt-Game)
2
reporter
Results show that Email and newswire text have
very different characteristics
Note A feature is denoted by its direction
(left/right) comparing to the focus word, offset,
and lexical value
7
Our Proposed Features
Note All features are instantiated for the focus
word t, and 3 tokens to the left and right of t
8
Feature Evaluation
  • Entity-level F1 of learned extractor (CRF) using
  • Basic features (B)
  • Basic and Email features (BE)
  • Basic and Dictionary features (BD)
  • All features (BDE)

Results show that 1) Dictionary and Email
features are useful (best when combined) 2)
Generally high precision but low recall
9
Whats Next?
  • Previous experiments show high precision but low
    recall
  • Next goal Improve recall
  • One recall-enhancing method
  • Look for multiple occurrences of names in a
    corpus
  • We conduct experimental studies
  • Examine repetition patterns of names in email and
    newswire text
  • Examine occurrences of names within a single
    document and across multiple documents

10
Doc. Frequency of Names
  • Percentage of person-name tokens that appear in
    at most K distinct documents as a function of K

Results show that Repetition of names across
multiple documents is more common in email corpora
Only 1.3 of names in MUC-6 appear in 10
documents
Percentage
About 20 of names in Mgmt-Game appear in 10
documents
Nearly 80 of names in MUC-6 appear only in one
document
30 of names in Mgmt-Game appear only in one
document
1
Document Frequency
11
Single vs. Multiple Documents
  • We define the following extractors
  • CRF baseline trained with all features
  • SDR (Single Document Repetition)
  • Rules that extract person-name tokens that
    appear more than once within a single document
    hence an upper bound on recall using only names
    repetition within a single document
  • MDR (Multiple Document Repetition)
  • Rules that extract person-name tokens that
    appear in more than one document hence an upper
    bound on recall using only names repetition
    across multiple documents
  • SDRCRF
  • Union of extractions by SDR and CRF hence an
    upper bound on recall using CRF and names
    repetition within a single document
  • MDRCRF
  • Union of extractions by MDR and CRF hence an
    upper bound on recall using CRF and names
    repetition across multiple documents

12
Single vs. Multiple Documents
  • Token-level upper bounds on recall and potential
    recall-gains associated with methods that look
    for name tokens that re-occur within a single
    document or across multiple documents

Results show that Higher recall and potential
recall-gains can be obtained for email corpora
exploiting MDR
13
Whats Next?
  • Our studies show the potential of exploiting
    repetition of names over multiple documents for
    improving recall in email corpora
  • We suggest a recall-enhancing method
  • Auto-construct a dictionary of predicted names
    and their variants from test set
  • Statistically filter out noisy names from the
    dictionary
  • Match names globally from the inferred dictionary
    onto test set, exploiting repetition of names

Note A dictionary is simply a list of one or
more tokens
14
Name Dictionary Construction
  • Every name in the test set predicted by the
    learned extractor (CRF), trained with all
    features, is transformed into a set of name
    variants and inserted into a dictionary

Transformation Example Name variants of Benjamin
Brown Smith
Original name is included by default
15
Name Dictionary Filtering
  • Previously constructed dictionary contains noisy
    names
  • i.e. brown can also refer to a color
  • Next goal Filter out noisy names
  • We suggest a filtering scheme to remove every
    single-token name w from the dictionary when
    PF.IDF(w) lt T

Predicted Frequency Inverse Document Frequency
Words that get low PF.IDF scores are either
highly ambiguous names or very common words in
corpus
cpf(w) of times w is predicted as a name-token
in corpus ctf(w) of occurrences of w in
corpus df(w) document frequency of w in
corpus N of documents in corpus
T 0.16 optimizes entity-level F1 in tune sets
thus, we apply the same threshold onto our test
sets
Note Corpus mentioned here refers to the test
set in our experiments
16
Name Matching
Filtered Dictionary
  • A window slides through every token in the test
    set
  • A match occurs when tokens in a window starts
    with the longest possible name variant in the
    dictionary
  • All matched names are marked for evaluation

benjamin brown smith benjamin-brown
smith benjamin brown-smith benjamin-brown-smith be
njamin brown s. benjamin-b. smith benjamin b.
smith benjamin brown-s. benjamin-brown
s. benjamin-brown-s benjamin-b.
s. benjamin-smith benjamin smith b. brown
smith benjamin b. s. b. brown-smith benjamin-s. be
njamin s. b. brown s. b. b. smith b.
brown-s. benjamin b. smith b. b. s. smith b. s.
Name Matching Example E-Mail
17
Experimental Results
  • Entity-level relative improvements (and final
    scores) after applying our recall-enhancing
    method on test sets
  • Baseline learned extractor (CRF) trained with
    all features

Results show that 1) Recall improved
significantly with small sacrifice in
precision 2) F1 scores improved in all cases
18
Conclusion
  • Email and newswire text have different
    characteristics
  • We suggested a set of specialized features for
    names extraction on email exploiting structural
    regularities in email
  • Exploiting name repetition over multiple
    documents is important for improving recall in
    email corpora
  • We presented the PF.IDF recall-enhancing method
    that improves recall significantly with small
    sacrifice in precision

19
Thank You!
20
References
Write a Comment
User Comments (0)
About PowerShow.com