Named Entity Recognition - PowerPoint PPT Presentation

About This Presentation

Named Entity Recognition


Named Entity Recognition Sobha Lalitha Devi AU-KBC Research Centre Chennai – PowerPoint PPT presentation

Number of Views:329
Avg rating:3.0/5.0
Slides: 61
Provided by: sob76


Transcript and Presenter's Notes

Title: Named Entity Recognition

Named Entity Recognition
  • Sobha Lalitha Devi
  • AU-KBC Research Centre
  • Chennai

Named Entity(NE) Recognition
  • What is NE and What is not an NE
  • How to identify NE
  • Tagset and Annotation Guidelines
  • Methods Used in developing NER

Why do NER?
  • Key part of Information Extraction system
  • Robust handling of proper names essential for
    many applications such as Summarization, IR,
  • Pre-processing for different classification
  • Information filtering
  • Information linking

What is NER ?
  • NER involves identification of proper names in
    texts, and classification into a set of
    predefined categories of interest.
  • Three universally accepted categories
  • Person, location and organisation
  • Other common tasks recognition of date/time
    expressions, measures (percent, money, weight
    etc), email addresses etc.
  • Other domain-specific entities names of Drugs,
    Genes, medical conditions, names of ships,
    bibliographic references etc.

NER Definition
  • Named entity recognition (NER) (also known as
    entity identification (EI) and entity extraction)
    is the task that locate and classify atomic
    elements in text into predefined categories such
    as the names of persons, organizations,
    locations, expressions of times, quantities,
    monetary values, percentages, etc.
  • John sold 5 companies in 2002.
    TYPE"QUANTITY"gt5lt/NUMEXgt companies in ltTIMEX

What is not NER?
  • NER is not event recognition.
  • NER does not create templates,
  • NER does not perform co-reference or entity
  • though these processes are often implemented
    alongside NER as part of a larger IE system.
  • NER is not just matching text strings with
    pre-defined lists of names.
  • It recognises entities which are being used as
    entities in a given context.
  • NER is not an easy task!

Named Entity and Philosophy of Language
  • Proper Names are defined by
  • Descriptivist's theory of Names
  • Frege, Russell, Ludwig , Wittgenstein and John
  • Causal theory of Reference
  • Saul Kripke

  • Descriptivist's theory of Names
  • Proper names either are synonymous with
    descriptions, or have their reference determined
    by virtue of the name's being associated with a
    description or cluster of descriptions that an
    object uniquely satisfies.
  • Causal theory of Reference
  • Proper names refer to an object by virtue of a
    causal connection
  • with the object as mediated through communities
    of speakers. That is , proper names, in contrast
    to descriptions, are rigid designators.
  • Rigid designators A proper name refers to the
    named object in every possible world in which the
    object exists.
  • Descriptions designate a proper name as
    different objects in different possible worlds.

Proper Names and Definite Descriptions
  • A meaning of a Sentences involving Proper names
    could be substituted by a contextually
    appropriate description for a name.
  • eg Otto von Bismarck can be known or described
    as the first Chancellor of the German Empire
  • Kripke argues that definite descriptions cannot
    be rigid designators . Because definite
    descriptions cannot be same/similar in all
    possible worlds
  • More on Kripkes Proper name in Naming and
    Necessity 1980

What is Named Entity
  • Named Entities are
  • A Noun Phrase
  • Rigid Designators It designates/denotes the
    same thing in all possible worlds in which the
    same thing exists and does not designate anything
    else in those possible worlds in which that same
    thing does not exist

EXAMPLES for Named Entity and not a Named entity
  • Hotel Taj Hotel
  • Flower Rose Flower
  • Beach Kovalam Beach
  • Airport Indira Gandhi International airport
  • The School Good Shepherd School
  • Prime Minister Mr. Manmohan Singh

Some problems in indentifying NE
  • Variation of NEs.
  • Manmohan Singh, Manmohan, Dr. Manmohan Singh
  • Ambiguity of NE types
  • 1945 (date vs. time)
  • Washington (location vs. person)
  • May (person vs. month)
  • Tata (person vs. organization)

Ambiguity Examples
  • Person vs Location
  • Sir C. P Ramaswamy was the Divan of Travancore
  • Sir C.P Ramaswamy Road is in Chennai (Loc)
  • Person vs Organization
  • Anil Ambani opened Reliance Fresh (Per)
  • Reliance Fresh is under Anil Amabani Group Ltd

More complex problems in NER
  • Issues of style, structure, domain, genre etc.
  • Punctuation, spelling, spacing, formatting, .all
    have an impact
  • Dept. of Computing and Information Science
  • Manchester Metropolitan University
  • Manchester
  • United Kingdom
  • gt Tell me more about Leonardo
  • gt Da Vinci

Problems in NE Task Definition
  • Category definitions are intuitively quite clear,
    but there are many grey areas.
  • Many of these grey area are caused by metonymy.
  • Person vs. Artefact
  • Organisation vs. Location
  • Company vs. Artefact
  • Location vs. Organisation

Tagset for Named Entity
  • ACE tagset is Hierarchical
  • ACE-Automatic Content Extraction
  • The tagset
  • CLIA-is Hierarchical -Similar to ACE
  • Developed for two domains
  • Tourism and Health

  • Manmade
  • Religious Places
  • Roads/Highways
  • Museum
  • Theme parks/Parks/Gardens
  • Monuments
  • Facilities
  • Hospitals
  • Institutes
  • Library
  • Hotel/Restaurants/Lodges
  • Plant/Factories
  • Police Station/Fire Services
  • Public Comfort Stations
  • Airports
  • Ports
  • Bus-Stations
  • Locomotives
  • Artifacts
  • Person
  • Individual
  • Family name
  • Title
  • Group
  • Organization
  • Government
  • Public/private company
  • Religious
  • Non-government
  • Political Party
  • Para military
  • Charitable
  • Association
  • GPE (Geo-political Social Entity)
  • Media
  • Location

Tagset Continued
Tagset Counts First Level Tags -3 Second Level
-43 Third Level 40 Total -
  • Distance
  • Money
  • Quantity
  • Count
  • Time
  • Date
  • Day
  • Period

How to Annotate
  • 1.ENAMEX
  • 1.1 Person
  • 1.1.1 Individual
  • These refer to names of each individual person,
    also includes names of fictional characters found
    in stories/novels etc.
  • Tag Structure
    INDIVIDUALgt abc lt/ENAMEXgt
  • Examples
  • English
    INDIVIDUALgtAbdul Kalamlt/ENAMEXgt

Annotation continued
  • Family name
  • In general we find that a person name consists
    of a family name. Whenever an instance of
    individual name occurs with family name, then
    that part of the name, which refers to family
    name, must be tagged specifically with subtag
    FAMILYNAME as shown below.
  • Tag Structure
  • Examples
  • English

NE Types

The Named entity hierarchy is divided into three
major classes Entity Name, Time and Numerical
Entity Types
Entity Name Types
  • Persons are entities limited to humans. A
    person may be a single individual or a group.
    Individual refer to names of each individual
    person. Group refers to set of individual
  • Location entities are limited to geographical
    entities such as geographical areas like names of
    countries, cities, continents and landmasses,
    bodies of water, and geological formations.
  • Organization entities are limited to
    corporations, agencies, and other groups of
    people defined by an established organizational

Examples for Entity Name Types
  • En SitaPERSON is working at HCLORGANIZATION
    , which is in Chennai LOCATION
  • Ta Seetha PERSON chennaiyilrukkira LOCATION
  • En Sita Chennai
  • velaiseikirAl.
  • Working
  • Ml Seetha PERSON chennaiyillula LOCATION
  • En Sita Chennai
  • jolicheyyunnu.
  • Working
  • Hi Seetha PERSON HCL ORGANIZATION main kaam
    kar raha hai, jo
  • En Sita HCL
    work is
  • chennai LOCATION main hain.
  • Chennai in

Entity Name Types
Facilities are limited to buildings and other
permanent man-made structures and real estate
improvements like hospitals, airport, colleges,
libraries etc. En Appolo Hospital FACILITY is
in Chennai LOCATION Ta Appallo
maruthuvamanAiFACILITY ChennaiyilLOCATION
irukkirathu Ml Appolo AsupathriFACILITY
chennaiyilLOCATION aaN Hi Appolo
aspathaalFACILITY chennaiLOCATION mein
Entity Name Types
A locomotive entity is a physical device
primarily designed to move an object from one
location to another, by carrying, pulling, or
pushing the transported object. En Ananthapuri
ExpressLOCOMOTIVE departs from Chennai
LOCATION at 7.30pm Time. Hi Ananthapuri
express LOCOMOTIVE Chennai LOCATION se rAth
7.30 TIME ko ravana hoga Ml Ananthapuri
eksprass LOCOMOTIVE chennaiyilninn LOCATION
raathri 7.30 maNikk TIME puRappetum. Ta
Ananthapuri viraivu rayil LOCOMOTIVE
chennaiyilirunthu LOCATION iRavu 7.30 maNikku
TIME puRappatukirathu
Entity Name Types
Artifact entities are objects or things, produced
or shaped by human craft, such as tools,
weapons/ammunition, art paintings, clothes,
ornaments, medicines En Vinayaga Statue
ARTIFACT is looking beautiful Ta Vinayakarin
Silai ARTIFACT pArpatharkku alakAkAkairukkirathu
Ml ganapathi vigrahamARTIFACT baMgiyaayi
irikkunnu. Hi Vinayaka moorthi ARTIFACT achi
lagh rahi haim.
Entity Name Types
Entertainment entities denote activities, which
are diverting and hold human attention or
interest, giving pleasure, happiness, amusement
especially performance of some kind such as
dance, music, sports, events. En Flower
Exhibition ENTERTAINMENT is held at
HyderabadLOCATION Ta Malar kankAtchi
Nadaiperukirathu Ml pushpa pradarshanam
natakkunnu Hi phool pradarshnii ENTERTAINMENT
hyderabad LOCATION meN Ayojith kiyaa jAthA hai
Entity Name Types
Materials refer to the names of food items,
cuisines, chemicals and cosmetics En
HoneyMATERIALS is good for face Ta
ThEnMATERIALS mukaththiRku nallathu Ml Madhu
MATERIALS mukaththinu nallathAN Hi Shahad
MATERIALS chehare ke liye achcha hai.
Entity Name Types
ORGANISMS These are the names of different
animal species including birds, reptiles,
viruses, bacteria and names of herbs, medicinal
plants, shrubs, trees, fruits, flowers etc. En
Peacock ORGANISM is the national bird of
InthiyAvin LOCATION thEciyappaRavai Akum. Ml
raashtrapakshi AN. Hi Mor ORGANISM
bhaarath LOCATION kaa raashtrIya pakshi hai.
Entity Name Types
Disease Names of disease, symptoms, diagonisis
and treatment are comes under this type. En
Smoking Causes Cancer DISEASE Ta
PukaippithithalAl puRRuNoi DISEASE
varukiRathu Ml pukavali aRbhudham DISEASE
uNtAkkunnu Hi dhumrapan kaansar DISEASE ka
kaaraN banaatha hai.
Numerical Expressions
Numerical Expressions
  • Distance refers to the distance measures such as
    kilometers, Centimeters, meters, acres, feet etc.
  • Example 10 cm., twenty feet, 15 hectares
  • Money specifies the different currency value such
    as rupee, euro, Dinar, dollar etc.
  • Example Rs. 1000, 250 Euro, 160
  • Count denotes the number (or counts) of Items/
    articles/things etc.
  • Example 5 subjects, 12 students, 20 books
  • Quantity measurements like liters, tons, grams,
    volts etc. are comes under this category.
  • Example 20 litres, 22 kg, 50g, 100 volts

Time Expressions
Temporal Expressions
  • Temporal expressions are the entities refers to
    time, date, year, month and day
  • Time These refer to expressions of time,
    includes different forms
  • of expressing time. This also includes Hours,
    minutes and seconds.
  • Example
  • 5o clock in the morning
  • 9.30 a.m.
  • Evening 6.30 p.m.
  • Date This refers to expressions of Date such as
    13/12/2001 etc in
  • different forms. This also includes month, date
    and year
  • Example
  • August 15 1947
  • 1956
  • September 11

Temporal Expressions
  • Day These are expressions, which convey days in
    a year. Also it can include
  • days occurring weekly /fortnightly/ monthly
    /quarterly/ biennial etc.
  • Example
  • Sunday
  • Tomorrow
  • Today
  • Yesterday
  • Special Day refers to special days in a year
  • Example
  • Gandhi Jayanthi
  • Rama Navami

Temporal Expressions
  • Period refers to expressions, which express
    duration of time or
  • time periods or time intervals.
  • Example
  • 17 th century
  • 10 minutes
  • 10 a.m. to 12 p.m.
  • One year

  • Methods
  • Rule Based
  • Machine Learning
  • Hidden Markov Model (HMM)
  • Naïve Bayes Classifier
  • Maximum Entropy Markov Model (MEMM)
  • Conditional random Fields (CRF)
  • 4) Hybrid Approach

Challenges of NER in Indian Languages
  • Following are the major challenges encountering
    in Indian Languages.
  • Agglutination
  • Ambiguity
  • Between Proper and common nouns
  • Between named entities
  • Lack of Capitalization

Challenges of NER in Indian Languages
Agglutination In Dravidian languages, words
consist of a lexical root to which one or more
affixes are attached. Example in Tamil 1) Ta
Ramanaiththavira (otherthan
Raman) 2) Ta Cevvaiyandru (On
Tuesday) 3) Ta Inthiyavilllula (In
India) 4) Ta KannanaippaRRikkondu
(hold onto Kannan)
Challenges of NER in Indian Languages
Example in Malayalam 1) Ml hemayiluNtaayirunna
(that which Hema have) 2) Ml
Chennaiyilethunna (reach in Chennai) 3)
Ml arabikatalinaBimukhamaayi
(towards the arabian sea) 4) Ml
kaaSiyilekkozhukunna ( flowing
towards kaaSi)
Challenges of NER in Indian Languages
  • Ambiguity
  • Comparatively Indian languages suffer more due to
    the ambiguity that exists between common proper
    nouns and between named entities itself. In some
    cases same word can refer to different named
    entity types. Those instances can recognized by
    contextual information.
  • Examples
  • Hi Akash - Person name and Sky
  • Hi Sooraj - Person name and Sun
  • Hi Chaanth Moon and Silver
  • Hi Aam Mango and Common
  • Ml Roopa Person name and Rupee
  • Ml Madhu Person name and Honey
  • Ml Mala Person name and Garland

Challenges of NER in Indian Languages
  • Ta Thinkal - Day and Month
  • Ta Malar - Person name and Flower
  • Ta Chevvai - Day and planet
  • Ta Shakthi Person name and Power
  • Ta MAlai Evening and Garland
  • Ta Ml Velli Silver, Planet, Day

Challenges of NER in Indian Languages
Spell Variation Due to the different writing
styles same entity is represented in various word
forms. In Tamil, sanskirit letters such as ja,
sha, sri Ha are replaced by sa,ciri,
ka Example Roja can be written as
Rosa Srimathi - cirimathi Raja -
rasa ShajahAn - sajakAn
Challenges of NER in Indian Languages
  • Lack of Capitalization
  • In English and some other European languages
    capitalization is considered as the important
    feature to identify proper noun.
  • It plays a major role in NE identification.
  • Unlike English capitalization concept is not
    found in Indian languages.

Nested Entities
Nested Entities Refers to the named entities
which occurs within another named entities. Also
called as embedded entities. Ta Mathurai
KoyilRELPLACE En Mathurai
Meenatchi Amman Temple Ml
Nittoor PERSON Srinivasa rao PERSON En
Nitoor Srinivasa rao Hi
Rajeev PERSON MArg ROAD En Rajeev
Approaches in Named Entity Resolution
  • Dictionary Look-up
  • Rule based ( Using lexical, contextual and
    morphological information)
  • Maximum entropy theory based
  • Hidden Markov Model
  • Conditional Random Fields
  • Hybrid methods (Statistical Linguistics)

Dictionary (Gazetteers) Look-up Approach
  • Uses Dictionaries for identifying NERs (
  • Gazetteer contains NEs from all domains
  • Advantage
  • Very simple approach
  • Gives very high precision

Disadvantages of Dictionary Approach
  • Preparation of exhaustive dictionary is a tedious
    and expensive process.
  • The dictionary should cover the different
    spellings of the same place.

Rule Based Approach
  • Rule Based System
  • Needs more rules to tag all kinds of NE
  • Advantages
  • Rich and expressive rules
  • Good results
  • Disadvantages
  • Requires huge experience and grammatical
  • Experts to craft rules are expensive
  • Highly domain specific ( not portable to a new

General difficulties
  • Italy's business world was rocked by the
    announcement last Thursday that Mr. Verdi would
    leave his job as vice-president of Music Masters
    of Milan, Inc. to become operations director of
    Arthur Andersen".
  • Capitalization useless for first word
  • S not part of name "Italy"
  • Date is "last Thursday" not "Thursday"
  • Milan is location, not organization
  • Arthur Andersen is organization, not person

Rules success and failure
  • Title Capitalized_Word Title Person_Name
  • Correct Mr. Jones
  • Incorrect Mrs. Field's Cookies (corporation)
  • Month_name number_less_than_32 Date
  • Correct February 28
  • Incorrect Long March 3 (a Chinese Rocket)
  • From Date to Date Date
  • Correct from August 3 to August 9
  • Incorrect I moved my trip from April to June
  • separate dates)

Statistical based approach
  • Need to identify features
  • Feature selection has to be correct for all types
    of NE
  • Development of Tagged Corpus
  • The Corpus should contain all types of tags in
    appropriate number
  • Domain based corpus has to be generated.

Automated approaches
  • Address drawbacks of hand-coded system
  • Automated training
  • Human-annotated (with desired output
  • standards) training data
  • Annotation requires less effort and expertise
  • than hand-coding rules
  • Annotation accuracy
  • Two annotators for checking, third annotator to
  • resolve disputes

Literature Survey
  • Named Entity Recognition was one of the tasks
    defined in Message Understanding
    Conference(MUC) 6.
  • A survey on Named Entity Recognition was done by
    David Nadeau (2007).
  • 3) Techniques used include
  • - rule based technique by Krupka (1998)
  • - using maximum entropy by Borthwick (1998)
  • - using Hidden Markov Model by Bikel (1997)
  • - bootstrapping approach using concept based
    seeds (Niu et al., 2003)
  • - hybrid approaches such as rule based tagging
    for certain entities such as date, time,
    percentage and maximum entropy based approach for
    entities like location and organization (Rohini
    et al.,2000)
  • 4) The Stanford NER software (Finkel et al.,
    2005), uses linear chain CRFs in their
  • NER engine. Here they identify three classes
    of NERs viz., Person, Organization
  • and Location.

  • Arulmozhi, P. and Sobha, L. (2006). HMM-based
    Part of Speech Tagger for Relatively Free
  • Word Order Language. Advances in Natural Language
    Processing, Research in Computing Science
    Journal, Mexico Volume18, pp. 37-48.
  • Bikel, D. M. Miller, S. Schwartz, R. Weischedel,
    R. (1997). Nymble A high-performance learning
    name-finder. In Fifth Conference on Applied
    Natural Language Processing. pp. 194201.
  • Borthwick, A. Sterling, J. Agichtein, E. and
    Grishman, R. (1998). Description of the MENE
    named Entity System. In Seventh Machine
    Understanding Conference (MUC-7).
  • Chen, W. Zhang, Y. and Isahara, H. (2006).
    Chinese Named Entity Recognition with Conditional
    Random Fields. In Fifth SIGHAN Workshop on
    Chinese Language Processing, Sydney. pp.118-121.
  • Ekbal, A. Bandyopadhyay, S. (2009). A Conditional
    Random Field Approach for Named Entity
    Recognition in Bengali and Hindi. Linguistic
    Issues in Language Technology, 2(1). pp.1-44.

  • Finkel, J. N. Grenager, T. and Manning, C.
    (2005). Incorporating Non-local Information into
    Information Extraction Systems by Gibbs Sampling.
    In 43nd Annual Meeting of the Association for
    Computational Linguistics (ACL 2005). pp.
  • Finkel, J. Dingare, S. Nguyen, H. Nissim, M.
    Sinclair, G. and Manning, C. (2004). Exploiting
    Context for Biomedical Entity Recognition from
    Syntax to the Web. In Joint Workshop on Natural
    Language Processing in Biomedicine and its
    Applications, (NLPBA), Geneva, Switzerland.
  • Gali, K. Surana, H. Vaidya, A. Shishtla, P.
    Sharma, D. M. (2008). Aggregating Machine
    Learning and Rule Based Heuristics for Named
    Entity Recognition. In Workshop on NER for South
    and South East Asian Languages, IJCNLP-08,
    Hyderabad, India.
  • Kumar, K. N. Santosh, G. S. K. Varma, V. (2011).
    A Language-Independent Approach to Identify the
    Named Entities in under-resourced languages and
    Clustering Multilingual Documents. In
    International Conference on Multilingual and
    Multimodal Information Access Evaluation,
    University of Amsterdam, Netherlands.
  • Lafferty, J. McCallum, A. Pereira, F. (2001).
    Conditional Random Fields for segmenting and
    labeling sequence data. In ICML-01, pp. 282-289.
  • Loinaz, I.A. Uriarte, O. A. Ramos, N. E. Castro,
    M. I. F. D (2006). Lessons from the Development
    of Named Entity Recognizer for Basque. Natural
    Language Processing, 36. pp. 25 37.
  • McCallum, A. and Li, W. (2003). Early Results for
    Named Entity Recognition with Conditional Random
    Fields, Feature Induction and Web-Enhanced
    Lexicons. In Seventh Conference on Natural
    Language Learning (CoNLL).

  • Nadeau, David and Sekine, S. (2007) A survey of
    named entity recognition and classification.
    Linguisticae Investigationes 30(1). pp.326.
  • Niu, C. Li, W. Ding, J. Srihari, R. K. (2003).
    Bootstrapping for Named Entity Tagging using
    Concept-based Seeds. In HLT-NAACL03, Companion
    Volume, Edmonton, AT. pp.73-75.
  • Pandian, S. Lakshmana, Geetha, T. V. and Krishna.
    (2007). Named Entity Recognition in Tamil using
    Context-cues and the E-M algorithm. In the
    Proceedings of the 3rd Indian International
    Conference on Artificial Intelligence, Pune,
    India. pp. 1951 -1958.
  • Sasidhar, B., Yohan, P.M., Babu, V.A., Govarhan,
    A.(2011). A Survey on Named Entity Recognition in
    Indian Languages with particular reference to
    Telugu. J. International Journal of Computer
    Science Issues, Volume. 8, pp. 1694-0814 .
  • Sobha, L., Vijay Sundar Ram. R. (2006). "Noun
    Phrase Chunker for Tamil", In Proceedings of
    Symposium on Modeling and Shallow Parsing of
    Indian Languages, Indian Institute of Technology,
    Mumbai, pp 194-198.
  • Srihari, R.K. Niu, C. Yu, L. (2000). A Hybrid
    Approach for Named Entity Recognition in Indian
    Languages. In 6th Applied Natural Language
    Conference, pp. 247-254
  • Gupta, S. and Bhattacharyya, P. (2010). Think
    globally, apply locally using distributional
    characteristics for Hindi named entity
    identification. In 2010 Named Entities Workshop,
    Association for Computational Linguistics
    Stroudsburg, PA, USA
  • Vijayakrishna, R. and Sobha, L. (2008). Domain
    focused Named Entity for Tamil using Conditional
    Random Fields. In IJNLP-08 workshop on NER for
    South and South East Asian Languages, Hyderabad,
    India. pp. 59-66

Literature Survey
  • Indian Languages
  • 5) Named Entity recognition for Hindi, Bengali,
    Oriya, Telugu and Urdu (some of the major Indian
    languages) were addressed as a shared task in the
    NERSSEAL workshop of IJCNLP. The tagset used here
    consisted of 12 tags.
  • 6) Vijayakrishna Sobha (2008) worked on Domain
    focused Tamil Named Entity Recognizer for Tourism
    domain using CRF. It handles nested tagging of
    named entities with a hierarchical tag set
    containing 106 tags. They considered root of
    words, POS, combined word and POS, Dictionary of
    named entities as features to build the system.
  • 7) Pandian et al (2007) have built a Tamil NER
    system using contextual cues and E-M algorithm.
  • 8) The NER system (Gali et al., 2008) build for
    NERSSEAL-2008 shared task which combines the
    machine learning techniques with language
    specific heuristics. The system has been tested
    on five languages such as Telugu, Hindi, Bengali,
    Urdu and Oriya using CRF followed by post
    processing which involves some heuristics.

Thank you
Write a Comment
User Comments (0)