CORPORUM-OntoExtract - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

CORPORUM-OntoExtract

Description:

CORPORUM-OntoExtract Ontology Extraction Tool Author: Robert Engels Company: CognIT a.s – PowerPoint PPT presentation

Number of Views:61
Avg rating:3.0/5.0
Slides: 28
Provided by: BYU1
Category:

less

Transcript and Presenter's Notes

Title: CORPORUM-OntoExtract


1
CORPORUM-OntoExtract Ontology Extraction Tool
Author Robert Engels Company CognIT a.s
2
Overview
  1. On-To-Knowledge project
  2. CORPORUM
  3. CORPORUM-OntoExtract
  4. Discussion
  5. Conclusion

3
What is Knowledge Management?
Knowledge Management is the collection of
processes that govern the creation,
dissemination, and utilization of knowledge.
--- Brian Newman, 1991
4
What is On-To-Knowledge (OTK) project?
Goals develop tools and methods for supporting
knowledge management relying on sharable and
reusable knowledge ontologies. The technical
backbone of On-To-Knowledge is the use of
ontologies for the various tasks of information
integration and mediation.
5
What is On-To-Knowledge (OTK) project?
  • European project in EU Information Society
    Technologies (IST) Program EU-IST-10132
  • Duration 2.5 years, January 2000 - June 2002
  • Total effort cost 26 personyears, 2.5 M EUR
  • Partners
  1. CognIT a.s
  2. AIdministrator
  3. AIFB (University of Karlsruhe)
  4. BT Research
  5. Enersearch
  6. Swiss Life Information Systems Research Group

6
CognIT a.s
  • Established in Halden, Norway in 1996.
  • 20 employees - 3 with PhD
  • CORPORUMTM
  • Develops Technology for
  1. intelligent search by means of agents
  2. text analysis and extraction
  3. structuring and fusing data to build knowledge
  4. knowledge bases and feedback of experience
  5. data mining and text mining

7
On-to-Knowledge workbench
  • CORPORUM-OntoExtract extract ontologies from
    unstructured documents and represent them in
    XML/RDF/OWL
  • CORPORUM-OntoWrapper extract ontologies from
    structured documents and represent them in
    XML/RDF/OWL
  • RDF-DB (Sesame)
  • RDF-Ferret interface between users and RQL
  • OntoEdit (Ontology Editor)
  • RQL engine query RDF-DB
  • DAML-OIL representation language

8
The OnToKnowledge system architecture
9
Introduction of CORPORUM
CORPORUM is a tool for information retrieval and
extraction developed by CognIT a.s.
  • crawl the internet and intranet
  • analyzing relevance and content
  • maintain knowledge base (RDF-DB)

Features
  • focus on the content
  • searches, cataloguing, summaries and extractions
    can be performed according to user interests
  • founded on CognlTs Mimir technology

10
The overall CORPORUM architecture
11
Introduction of CORPORUM
Core technology -- MIMIR includes
  • Linguistic analysis through all levels and
    generate user interested ontology in RDF.
  • Similar analysis obtain documents which are most
    pertinent to a specific analyzed text.
    (information retrieval and extraction)

12
Classical Natural Language processing
decomposed.
13
Mimir architecture
14
Introduction of CORPORUM
Informaton distribution
Histogram showing where the desired content in
the document can be found and to what degree it
is pertinent.
15
CORPORUM-OntoExtract
  • The web-based version of a CORPORUM version
  • Use same architecture as the CORPORUM
  • Extract ontologies from unstructured web pages
  • Represent extracted ontologies in XML/RDF/OIL

16
CORPORUM-OntoExtract
  • CMOntoBuild taken care of overall control of the
    system and co-ordinating all information flows
  • CMWebHandler responisble for collecting all
    (text-) documents from a specific site
  • CMCogLib analysis texts, extracts information,
    exports a variety of formats
  • CMLexEn language dependent support module for
    CMCoglib
  • CMWebInteract communication component that takes
    care of all interaction of CORPORUM-OntoExtract
    with the RDF database. Responsible for querying
    the RDF-DB, as well as submitting final analysis
    results.
  • DOMhandler integrated in CMWebInteract, the
    OpenXML DOM handler takes care of the
    interpretation of the results which are returned
    from the RDF server

17
CORPORUM-OntoExtract performs the following tasks
  • CMOntoBuild is invoked by the user
  • CMWebHandler is invoked by CMOntoBuild
  • CMWebHandler retrieves the domain that is
    specified from the intra/internet and returns it
    to CMOntoBuild
  • CMOntoBuild passes texts to the CMCoglib that
    analyses, interprets and extracts information
    from these texts, and returns a basic RDF
    representation to CMOntoBuild
  • CMOntoBuild now analyses the generated RDF and
    queries the RDF Ontology repository to try to
    find knowledge that can augment the previously
    generated RDF
  • When all querying that could be performed is
    done, and the RDF is augmented, the final RDF
    ontology for a specific document is sent to the
    RDF server together with a
  • reference to the original text.

18
Client/Server based System Architecture of
CORPORUM-OntoExtract
19
The overall CORPORUM architecture
20
CORPORUM-OntoExtract output
  • Namespace definitions
  • Dublin Core based metadata
  • Property definitions
  • Ontology
  • Facts/instances
  • Cross-taxonomic relations

21
Discussion on use of CORPORUM technology in
OntoExtract
Content in natural language vs. content in
structure
  • CORPORUM-OntoExtracte can capture content without
    considering the layout and structure of the
    texts.
  • In some cases, the structure of texts has to be
    considered. Contracts, licenses.
  • CORPORUM-OntoWrapper

22
Discussion on use of CORPORUM technology in
OntoExtract
Diversity of web pages (unknown intention)
  • Diversity of documents on the web
  • It is difficult to analyze a text according to
    the intention of the writers
  • Combination of CORPORUM-OntoExtract with
    CORPORUM-OntoWrapper might some of these issues

23
Discussion on use of CORPORUM technology in
OntoExtract
Representational issues (A-box vs. T-box
reasoning)
  • TBox Tbox consists of (class) concept inclusion
    axioms (and/or equivalence) -- e.g., "C subsumes
    D.
  • ABox Abox consists of individual/tuple
    membership axioms - e.g., "x is an instance of C"
    or "ltx,ygt is an instance of R".
  • Most of the CORPORUM-OntoExtract generated
    knowledge is TBox knowledge.

24
Discussion on use of CORPORUM technology in
OntoExtract
Domain specificity of extracted knowledge
  • Since the ontologies are extracted from specified
    domains, the extracted information is expected to
    be restricted in these domains.
  • Positive while many of the searches will also be
    rather domain specific, and knowledge about
    cross-taxonomic relations might come in very
    handy.
  • Negative one may like to build up domain
    independent knowledge bases.

25
Conclusion
  • CORPORUM helps web become more semantic.
  • Semantic-based technology.
  • Enhance usability of formal knowledge
    representations for end-users
  • Decrease initial efforts when defining an
    ontology in new domains

26
Conclusion
  • Dynamicity of the analysis, i.e. ease of use in
    dynamic environments
  • Offer new ways of navigating knowledge bases and
    documents sets by visualization of contents and
    by means of semantic-based, graphic structures
  • Extract of content-based meta-data from
    documents, such as important concepts, semantic
    structures, etc.
  • Ability to offer domain-specific information as
    related-keywords

27
Comments
  • Description is too general. No examples and
    details.
  • Weak sentences. Complicate sentence structures.

28
Questions
Write a Comment
User Comments (0)
About PowerShow.com