Text mining and the Semantic Web - PowerPoint PPT Presentation

Loading...

PPT – Text mining and the Semantic Web PowerPoint presentation | free to view - id: 150a16-ZTVhY



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Text mining and the Semantic Web

Description:

Department of Computer Science. University of Sheffield. University of ... Which places on the East Coast of the US have had cases of West Nile Virus? ... – PowerPoint PPT presentation

Number of Views:204
Avg rating:3.0/5.0
Slides: 47
Provided by: valenti4
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Text mining and the Semantic Web


1
Text mining and the Semantic Web
  • Dr Diana Maynard
  • NLP Group
  • Department of Computer Science
  • University of Sheffield

2
Structure of this lecture
  • Text Mining and the Semantic Web
  • Text Mining Components / Methods
  • Information Extraction
  • Evaluation
  • Visualisation
  • Summary

3
Introduction to Text Mining and the Semantic Web
4
What is Text Mining?
  • Text mining is about knowledge discovery from
    large collections of unstructured text.
  • Its not the same as data mining, which is more
    about discovering patterns in structured data
    stored in databases.
  • Similar techniques are sometimes used, however
    text mining has many additional constraints
    caused by the unstructured nature of the text and
    the use of natural language.
  • Information extraction (IE) is a major component
    of text mining.
  • IE is about extracting facts and structured
    information from unstructured text.

5
Challenge of the Semantic Web
  • The Semantic Web requires machine processable,
    repurposable data to complement hypertext
  • Such metadata can be divided into two types of
    information explicit and implicit. IE is mainly
    concerned with implicit (semantic) metadata.
  • More on this later

6
Text mining components and methods
7
Text mining stages
  • Document selection and filtering (IR techniques)
  • Document pre-processing (NLP techniques)
  • Document processing (NLP / ML / statistical
    techniques)

8
Stages of document processing
  • Document selection involves identification and
    retrieval of potentially relevant documents from
    a large set (e.g. the web) in order to reduce the
    search space. Standard or semantically-enhanced
    IR techniques can be used for this.
  • Document pre-processing involves cleaning and
    preparing the documents, e.g. removal of
    extraneous information, error correction,
    spelling normalisation, tokenisation, POS
    tagging, etc.
  • Document processing consists mainly of
    information extraction
  • For the Semantic Web, this is realised in terms
    of metadata extraction

9
Metadata extraction
  • Metadata extraction consists of two types
  • Explicit metadata extraction involves information
    describing the document, such as that contained
    in the header information of HTML documents
    (titles, abstracts, authors, creation date, etc.)
  • Implicit metadata extraction involves semantic
    information deduced from the material itself,
    i.e. endogenous information such as names of
    entities and relations contained in the text.
    This essentially involves Information Extraction
    techniques, often with the help of an ontology.

10
Information Extraction (IE)
11
IE is not IR
IR pulls documents from large text collections
(usually the Web) in response to specific
keywords or queries. You analyse the documents.
IE pulls facts and structured information from
the content of large text collections. You
analyse the facts.
12
IE for Document Access
  • With traditional query engines, getting the facts
    can be hard and slow
  • Where has the Queen visited in the last year?
  • Which places on the East Coast of the US have
    had cases of West Nile Virus?
  • Which search terms would you use to get this kind
    of information?
  • How can you specify you want someones home page?
  • IE returns information in a structured way
  • IR returns documents containing the relevant
    information somewhere (if youre lucky)

13
IE as an alternative to IR
  • IE returns knowledge at a much deeper level than
    traditional IR
  • Constructing a database through IE and linking it
    back to the documents can provide a valuable
    alternative search tool.
  • Even if results are not always accurate, they can
    be valuable if linked back to the original text

14
Some example applications
  • HaSIE
  • KIM
  • Threat Trackers

15
HaSIE
  • Application developed by University of Sheffield,
    which aims to find out how companies report about
    health and safety information
  • Answers questions such as
  • How many members of staff died or had accidents
    in the last year?
  • Is there anyone responsible for health and
    safety?
  • What measures have been put in place to improve
    health and safety in the workplace?

16
HASIE
  • Identification of such information is too
    time-consuming and arduous to be done manually
  • IR systems cant cope with this because they
    return whole documents, which could be hundreds
    of pages
  • System identifies relevant sections of each
    document, pulls out sentences about health and
    safety issues, and populates a database with
    relevant information

17
HASIE
18
KIM
  • KIM is a software platform developed by Ontotext
    for semantic annotation of text.
  • KIM performs automatic ontology population and
    semantic annotation for Semantic Web and KM
    applications
  • Indexing and retrieval (an IE-enhanced search
    technology)
  • Query and exploration of formal knowledge

19
KIM
Ontotexts KIM query and results
20
Threat tracker
  • Application developed by Alias-I which finds and
    relates information in documents
  • Intended for use by Information Analysts who use
    unstructured news feeds and standing collections
    as sources
  • Used by DARPA for tracking possible information
    about terrorists etc.
  • Identification of entities, aliases, relations
    etc. enables you to build up chains of related
    people and things

21
Threat tracker
22
What is Named Entity Recognition?
  • Identification of proper names in texts, and
    their classification into a set of predefined
    categories of interest
  • Persons
  • Organisations (companies, government
    organisations, committees, etc)
  • Locations (cities, countries, rivers, etc)
  • Date and time expressions
  • Various other types as appropriate

23
Why is NE important?
  • NE provides a foundation from which to build more
    complex IE systems
  • Relations between NEs can provide tracking,
    ontological information and scenario building
  • Tracking (co-reference) Dr Head, John, he
  • Ontologies Manchester, CT
  • Scenario Dr Head became the new director of
    Shiny Rockets Corp

24
Two kinds of approaches
  • Knowledge Engineering
  • rule based
  • developed by experienced language engineers
  • make use of human intuition
  • require only small amount of training data
  • development can be very time consuming
  • some changes may be hard to accommodate
  • Learning Systems
  • use statistics or other machine learning
  • developers do not need LE expertise
  • require large amounts of annotated training data
  • some changes may require re-annotation of the
    entire training corpus

25
Typical NE pipeline
  • Pre-processing (tokenisation, sentence splitting,
    morphological analysis, POS tagging)
  • Entity finding (gazeteer lookup, NE grammars)
  • Coreference (alias finding, orthographic
    coreference etc.)
  • Export to database / XML

26
GATE and ANNIE
  • GATE (Generalised Architecture for Text
    Engineering) is a framework for language
    processing
  • ANNIE (A Nearly New Information Extraction
    system) is a suite of language processing tools,
    which provides NE recognition
  • GATE also includes
  • plugins for language processing, e.g. parsers,
    machine learning tools, stemmers, IR tools, IE
    components for various languages etc.
  • tools for visualising and manipulating ontologies
  • ontology-based information extraction tools
  • evaluation and benchmarking tools

27
GATE
28
Information Extraction for the Semantic Web
  • Traditional IE is based on a flat structure, e.g.
    recognising Person, Location, Organisation, Date,
    Time etc.
  • For the Semantic Web, we need information in a
    hierarchical structure
  • Idea is that we attach semantic metadata to the
    documents, pointing to concepts in an ontology
  • Information can be exported as an ontology
    annotated with instances, or as text annotated
    with links to the ontology

29
Richer NE Tagging
  • Attachment of instances in the text to concepts
    in the domain ontology
  • Disambiguation of instances, e.g. Cambridge, MA
    vs Cambridge, UK

30
Magpie
  • Developed by the Open University
  • Plugin for standard web browser
  • Automatically associates an ontology-based
    semantic layer to web resources, allowing
    relevant services to be linked
  • Provides means for a structured and informed
    exploration of the web resources
  • e.g. looking at a list of publications, we can
    find information about an author such as projects
    they work on, other people they work with, etc.

31
MAGPIE in action
32
MAGPIE in action
33
Evaluation
34
Evaluation metrics and tools
  • Evaluation metrics mathematically define how to
    measure the systems performance against
    human-annotated gold standard
  • Scoring program implements the metric and
    provides performance measures
  • for each document and over the entire corpus
  • for each type of NE
  • may also evaluate changes over time
  • A gold standard reference set also needs to be
    provided this may be time-consuming to produce
  • Visualisation tools show the results graphically
    and enable easy comparison

35
Methods of evaluation
  • Traditional IE is evaluated in terms of Precision
    and Recall
  • Precision - how accurate were the answers the
    system produced?
  • correct answers/answers produced
  • Recall - how good was the system at finding
    everything it should have found?
  • correct answers/total possible correct answers
  • There is usually a tradeoff between precision and
    recall, so a weighted average of the two
    (F-measure) is generally also used.

36
GATE AnnotationDiff Tool
37
Metrics for Richer IE
  • Precision and Recall are not sufficient for
    ontology-based IE, because the distinction
    between right and wrong is less obvious
  • Recognising a Person as a Location is clearly
    wrong, but recognising a Research Assistant as a
    Lecturer is not so wrong
  • Similarity metrics need to be integrated
    additionally, such that items closer together in
    the hierarchy are given a higher score, if wrong
  • Also possible is a cost-based approach, where
    different weights can be given to each concept in
    the hierarchy, and to different types of error,
    and combined to form a single score

38
Visualisation of Results
39
Visualisation of Results
  • Cluster Map example
  • Traditionally used to show documents classified
    according to topic
  • Here shows instances classified according to
    concept
  • Enables analysis, comparison and querying of
    results
  • Examples here created by Marta Sabou (Free
    University of Amsterdam) using Aduna software

40
The principle Venn Diagrams
Documents classified according to topic
41
Jobs by region
Instances classified by concept
42
Concept distribution
Shows the relative importance of different
concepts
43
Correct and incorrect instances attached to
concepts
44
Summary
  • Introduction to text mining and the semantic web
  • How traditional information extraction
    techniques, including visualisation and
    evaluation, can be extended to deal with
    complexity of the Semantic Web
  • How text mining can help the progression of the
    Semantic Web

45
Research questions
  • Automatic annotation tools are currently mainly
    domain and ontology-dependent, and work best on a
    small scale
  • Tools designed for large scale applications lose
    out on accuracy
  • Ontology population works best when the ontology
    already exists, but how do we ensure accurate
    ontology generation?
  • Need large scale evaluation programs

46
Some useful links
  • NaCTem (National centre for text mining)
  • http//www.nactem.ac.uk
  • GATE
  • http//gate.ac.uk
  • KIM
  • http//www.ontotext.com/kim/
  • h-TechSight
  • http//www.h-techsight.org
  • Magpie
  • http//www.kmi.open.ac.uk/projects/magpie
About PowerShow.com