Text mining and the Semantic Web - PowerPoint PPT Presentation


PPT – Text mining and the Semantic Web PowerPoint presentation | free to view - id: 150a16-ZTVhY


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation

Text mining and the Semantic Web


Department of Computer Science. University of Sheffield. University of ... Which places on the East Coast of the US have had cases of West Nile Virus? ... – PowerPoint PPT presentation

Number of Views:204
Avg rating:3.0/5.0
Slides: 47
Provided by: valenti4


Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Text mining and the Semantic Web

Text mining and the Semantic Web
  • Dr Diana Maynard
  • NLP Group
  • Department of Computer Science
  • University of Sheffield

Structure of this lecture
  • Text Mining and the Semantic Web
  • Text Mining Components / Methods
  • Information Extraction
  • Evaluation
  • Visualisation
  • Summary

Introduction to Text Mining and the Semantic Web
What is Text Mining?
  • Text mining is about knowledge discovery from
    large collections of unstructured text.
  • Its not the same as data mining, which is more
    about discovering patterns in structured data
    stored in databases.
  • Similar techniques are sometimes used, however
    text mining has many additional constraints
    caused by the unstructured nature of the text and
    the use of natural language.
  • Information extraction (IE) is a major component
    of text mining.
  • IE is about extracting facts and structured
    information from unstructured text.

Challenge of the Semantic Web
  • The Semantic Web requires machine processable,
    repurposable data to complement hypertext
  • Such metadata can be divided into two types of
    information explicit and implicit. IE is mainly
    concerned with implicit (semantic) metadata.
  • More on this later

Text mining components and methods
Text mining stages
  • Document selection and filtering (IR techniques)
  • Document pre-processing (NLP techniques)
  • Document processing (NLP / ML / statistical

Stages of document processing
  • Document selection involves identification and
    retrieval of potentially relevant documents from
    a large set (e.g. the web) in order to reduce the
    search space. Standard or semantically-enhanced
    IR techniques can be used for this.
  • Document pre-processing involves cleaning and
    preparing the documents, e.g. removal of
    extraneous information, error correction,
    spelling normalisation, tokenisation, POS
    tagging, etc.
  • Document processing consists mainly of
    information extraction
  • For the Semantic Web, this is realised in terms
    of metadata extraction

Metadata extraction
  • Metadata extraction consists of two types
  • Explicit metadata extraction involves information
    describing the document, such as that contained
    in the header information of HTML documents
    (titles, abstracts, authors, creation date, etc.)
  • Implicit metadata extraction involves semantic
    information deduced from the material itself,
    i.e. endogenous information such as names of
    entities and relations contained in the text.
    This essentially involves Information Extraction
    techniques, often with the help of an ontology.

Information Extraction (IE)
IE is not IR
IR pulls documents from large text collections
(usually the Web) in response to specific
keywords or queries. You analyse the documents.
IE pulls facts and structured information from
the content of large text collections. You
analyse the facts.
IE for Document Access
  • With traditional query engines, getting the facts
    can be hard and slow
  • Where has the Queen visited in the last year?
  • Which places on the East Coast of the US have
    had cases of West Nile Virus?
  • Which search terms would you use to get this kind
    of information?
  • How can you specify you want someones home page?
  • IE returns information in a structured way
  • IR returns documents containing the relevant
    information somewhere (if youre lucky)

IE as an alternative to IR
  • IE returns knowledge at a much deeper level than
    traditional IR
  • Constructing a database through IE and linking it
    back to the documents can provide a valuable
    alternative search tool.
  • Even if results are not always accurate, they can
    be valuable if linked back to the original text

Some example applications
  • HaSIE
  • KIM
  • Threat Trackers

  • Application developed by University of Sheffield,
    which aims to find out how companies report about
    health and safety information
  • Answers questions such as
  • How many members of staff died or had accidents
    in the last year?
  • Is there anyone responsible for health and
  • What measures have been put in place to improve
    health and safety in the workplace?

  • Identification of such information is too
    time-consuming and arduous to be done manually
  • IR systems cant cope with this because they
    return whole documents, which could be hundreds
    of pages
  • System identifies relevant sections of each
    document, pulls out sentences about health and
    safety issues, and populates a database with
    relevant information

  • KIM is a software platform developed by Ontotext
    for semantic annotation of text.
  • KIM performs automatic ontology population and
    semantic annotation for Semantic Web and KM
  • Indexing and retrieval (an IE-enhanced search
  • Query and exploration of formal knowledge

Ontotexts KIM query and results
Threat tracker
  • Application developed by Alias-I which finds and
    relates information in documents
  • Intended for use by Information Analysts who use
    unstructured news feeds and standing collections
    as sources
  • Used by DARPA for tracking possible information
    about terrorists etc.
  • Identification of entities, aliases, relations
    etc. enables you to build up chains of related
    people and things

Threat tracker
What is Named Entity Recognition?
  • Identification of proper names in texts, and
    their classification into a set of predefined
    categories of interest
  • Persons
  • Organisations (companies, government
    organisations, committees, etc)
  • Locations (cities, countries, rivers, etc)
  • Date and time expressions
  • Various other types as appropriate

Why is NE important?
  • NE provides a foundation from which to build more
    complex IE systems
  • Relations between NEs can provide tracking,
    ontological information and scenario building
  • Tracking (co-reference) Dr Head, John, he
  • Ontologies Manchester, CT
  • Scenario Dr Head became the new director of
    Shiny Rockets Corp

Two kinds of approaches
  • Knowledge Engineering
  • rule based
  • developed by experienced language engineers
  • make use of human intuition
  • require only small amount of training data
  • development can be very time consuming
  • some changes may be hard to accommodate
  • Learning Systems
  • use statistics or other machine learning
  • developers do not need LE expertise
  • require large amounts of annotated training data
  • some changes may require re-annotation of the
    entire training corpus

Typical NE pipeline
  • Pre-processing (tokenisation, sentence splitting,
    morphological analysis, POS tagging)
  • Entity finding (gazeteer lookup, NE grammars)
  • Coreference (alias finding, orthographic
    coreference etc.)
  • Export to database / XML

  • GATE (Generalised Architecture for Text
    Engineering) is a framework for language
  • ANNIE (A Nearly New Information Extraction
    system) is a suite of language processing tools,
    which provides NE recognition
  • GATE also includes
  • plugins for language processing, e.g. parsers,
    machine learning tools, stemmers, IR tools, IE
    components for various languages etc.
  • tools for visualising and manipulating ontologies
  • ontology-based information extraction tools
  • evaluation and benchmarking tools

Information Extraction for the Semantic Web
  • Traditional IE is based on a flat structure, e.g.
    recognising Person, Location, Organisation, Date,
    Time etc.
  • For the Semantic Web, we need information in a
    hierarchical structure
  • Idea is that we attach semantic metadata to the
    documents, pointing to concepts in an ontology
  • Information can be exported as an ontology
    annotated with instances, or as text annotated
    with links to the ontology

Richer NE Tagging
  • Attachment of instances in the text to concepts
    in the domain ontology
  • Disambiguation of instances, e.g. Cambridge, MA
    vs Cambridge, UK

  • Developed by the Open University
  • Plugin for standard web browser
  • Automatically associates an ontology-based
    semantic layer to web resources, allowing
    relevant services to be linked
  • Provides means for a structured and informed
    exploration of the web resources
  • e.g. looking at a list of publications, we can
    find information about an author such as projects
    they work on, other people they work with, etc.

MAGPIE in action
MAGPIE in action
Evaluation metrics and tools
  • Evaluation metrics mathematically define how to
    measure the systems performance against
    human-annotated gold standard
  • Scoring program implements the metric and
    provides performance measures
  • for each document and over the entire corpus
  • for each type of NE
  • may also evaluate changes over time
  • A gold standard reference set also needs to be
    provided this may be time-consuming to produce
  • Visualisation tools show the results graphically
    and enable easy comparison

Methods of evaluation
  • Traditional IE is evaluated in terms of Precision
    and Recall
  • Precision - how accurate were the answers the
    system produced?
  • correct answers/answers produced
  • Recall - how good was the system at finding
    everything it should have found?
  • correct answers/total possible correct answers
  • There is usually a tradeoff between precision and
    recall, so a weighted average of the two
    (F-measure) is generally also used.

GATE AnnotationDiff Tool
Metrics for Richer IE
  • Precision and Recall are not sufficient for
    ontology-based IE, because the distinction
    between right and wrong is less obvious
  • Recognising a Person as a Location is clearly
    wrong, but recognising a Research Assistant as a
    Lecturer is not so wrong
  • Similarity metrics need to be integrated
    additionally, such that items closer together in
    the hierarchy are given a higher score, if wrong
  • Also possible is a cost-based approach, where
    different weights can be given to each concept in
    the hierarchy, and to different types of error,
    and combined to form a single score

Visualisation of Results
Visualisation of Results
  • Cluster Map example
  • Traditionally used to show documents classified
    according to topic
  • Here shows instances classified according to
  • Enables analysis, comparison and querying of
  • Examples here created by Marta Sabou (Free
    University of Amsterdam) using Aduna software

The principle Venn Diagrams
Documents classified according to topic
Jobs by region
Instances classified by concept
Concept distribution
Shows the relative importance of different
Correct and incorrect instances attached to
  • Introduction to text mining and the semantic web
  • How traditional information extraction
    techniques, including visualisation and
    evaluation, can be extended to deal with
    complexity of the Semantic Web
  • How text mining can help the progression of the
    Semantic Web

Research questions
  • Automatic annotation tools are currently mainly
    domain and ontology-dependent, and work best on a
    small scale
  • Tools designed for large scale applications lose
    out on accuracy
  • Ontology population works best when the ontology
    already exists, but how do we ensure accurate
    ontology generation?
  • Need large scale evaluation programs

Some useful links
  • NaCTem (National centre for text mining)
  • http//www.nactem.ac.uk
  • GATE
  • http//gate.ac.uk
  • KIM
  • http//www.ontotext.com/kim/
  • h-TechSight
  • http//www.h-techsight.org
  • Magpie
  • http//www.kmi.open.ac.uk/projects/magpie
About PowerShow.com