Information Extraction in Semantic Web - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Information Extraction in Semantic Web

Description:

Yahoo spends millions of dollars to classify web sites. ... So we need an agent that crawls the web and classify the web page to extract ... – PowerPoint PPT presentation

Number of Views:53
Avg rating:3.0/5.0
Slides: 23
Provided by: amreek
Category:

less

Transcript and Presenter's Notes

Title: Information Extraction in Semantic Web


1
Information Extraction in Semantic Web
  • Amreek Singh
  • (02329025)

2
Motivation
  • The www is growing at an exponential rate.
  • User will be lost in jungle of information.
  • Need for proper classifier.
  • Currently we depend on humans for classification
    and maintenance.

3
Motivation (Cont.)
  • Yahoo spends millions of dollars to classify web
    sites.
  • Html is designed for displaying data but is not
    suitable for extracting useful information.
  • This manual way of classification of www is not
    scalable.

4
The semantic web approach
  • So we need machine to understand these pages and
    be able
  • Classify
  • Extract information

5
Extracting Information
  • Information extraction can be done by any agent
    in between client and server (hosting source
    pages)
  • These agents can be viewed as Semantic Agent

6
Challenges to Developing Semantic Agent
  • So we need an agent that crawls the web and
    classify the web page to extract information on
    its own.
  • Even with natural language understanding it is
    very difficult, if not impossible to make the web
    agent read and understand the web.

7
One step towards next Internet
  • HTML XML
  • HTML XHTML XML
  • Imagine best parts of html and mixing them all
    with great features of xml.

8
What Is XHTML?
  • XHTML is Extensible HTML standard developed to
    help efficient extraction of information from
    web-pages
  • Link http//www.xhtml.org/

9
Issues in XHTML
  • Documents must be well formed i.e. must have
    closing tags and nested properly.
  • Elements and attributes must also be in lower
    case.
  • Attributes must be quoted.
  • For empty tags either tags can be used.
  • ltbrgtlt/brgt or ltbr /gt
  • Inline style sheets and java scripts must be in
    separate files.

10
Benefits of Xhtml
  • Xhtml documents are well formed so leads to
    quicker and smaller parsers.
  • A significant limitation of the HTML is its FORM
    field.
  • Pre built-in functions remove the need to use
    javascript as heavily as in the past.
  • Allowing to add voice or other input methods.
  • Data is transmitted in xml format.

11
Benefits of Xhtml (cont.)
  • Xhtml is now standard.
  • Xhtml documents are backward compatible.
  • Xhtml can utilize application (scripts and
    applets) that rely either upon html DOM or xml
    DOM.
  • The next internet world is of XML, and Xhtml is
    the step towards.

12
Real Power of Xhtml
  • Xhtml can be directly transformed into another
    xml structure. But html can not.
  • Xpath gives you a way to perform complex queries
    on the nodes and extract data at any level of
    complexity.
  • Traditionally index servers use a brute force
    method to index a site, recording the position of
    given words in their respective files.

13
(Cont.)
  • User agents that access Xhtml documents served as
    text/html can use HTML DOM.
  • User agents that access Xhtml documents served as
    text/xml via DOM can also use XML DOM.

14
Two approaches to embed Xhtml in HTML pages
  • Authors manually annotate their web pages w.r.t.
    one or more ontologies provided.
  • Automatic annotation of web pages with machine
    learning techniques.

15
Issues with first approach
  • First method is of no use, if authors do not
    annotate their web pages.
  • The information provided is totally dependent on
    the ontology used by author.
  • While updating author may only update html
    document and forgets to update the annotations.

16
Issues with the second approach
  • Machine learning techniques have proven to be of
    great practical value in many applications like
    data mining, speech recognition etc.
  • However, it is very difficult to guess what the
    other person is thinking.

17
Semantic Web
  • Web in which information has well defined meaning
    and enables computers and people to work in
    cooperation.
  • Makes machines better understand the data rather
    than just displaying it.

18
Approaches adapted by Semantic web
  • How to add logic to web.
  • Define language that expresses both the data and
    the rules for reasoning about the data.

19
Simulation of Semantic Web
  • Given a set of html web sites that are annotated
    with special tags.
  • First program understands the meaning of the
    Search String.
  • Extracts the domain of web pages that is given.

20
  • Program extracts the required information from
    the web pages, in a way that it extracts the
    related information also.
  • Eg. If search is given for address,
  • Program will search for address as well as
    location, home, house, etc.

21
References
  • http//www.javascriptkit.com/howto/xhtml_intro2.sh
    tml
  • http//www.xhtml.org/
  • http//www.w3c.org/
  • http//research.nii.ac.jp/collier/papers/LREC-200
    2.3b.pdf
  • http//infomesh.net/2002/augmeta/sec-xhtml
  • http//216.239.39.100/search?qcacheizh-0ZZoha8C
    research.nii.ac.jp/collier/papers/
  • Programming Perl, OReilly

22
  • Thank You
Write a Comment
User Comments (0)
About PowerShow.com