Text in Oracle The Search Platform and Ultra Search - PowerPoint PPT Presentation

About This Presentation
Title:

Text in Oracle The Search Platform and Ultra Search

Description:

Special Portlet' crawls inside and outside of Portal Repository. ... Crawls both fulltext and fielded' ... Crawls, Indexes, and makes searchable your Intranet. ... – PowerPoint PPT presentation

Number of Views:363
Avg rating:3.0/5.0
Slides: 55
Provided by: analys4
Learn more at: http://www.nocoug.org
Category:

less

Transcript and Presenter's Notes

Title: Text in Oracle The Search Platform and Ultra Search


1
Text in OracleThe Search Platform and Ultra
Search
Omar Alonso, Senior Product Manager, Oracle
Corp. Stefan Buchta, Principal Product Manager,
Oracle Corp. NoCOUG May 16th 2001
2
Agenda
  • What is Oracle Text?
  • Introducing Oracle Text
  • Text in the database Why Integration is Key
  • Performance and scalability
  • Ease of Use
  • Global Solutions
  • Search Quality
  • Specialized Indexes
  • XML
  • Document Services
  • Ultra Search
  • Summary

3
What is Oracle Text?
  • Formerly know as interMedia Text
  • Oracle Text adds powerful text search and
    intelligent text management capabilities to the
    Oracle database.
  • Oracle Text
  • Fully integrated with the database
  • Offers premier text search quality
  • Provides several advances features for text
    management, document services, XML, etc.
  • Has the best internationalization set of features
    for multilingual text search applications.

4
Introducing Oracle Text An example
  • create index description_idx on
  • PRODUCT_INFORMATION(PRODUCT_DESCRIPTION)
  • indextype is ctxsys.context
  • select score(1), product_id, product_name
  • from product_information
  • where contains (product_description, 'monitor
    NEAR "high resolution"', 1)gt0
  • order by score(1) desc
  • SCORE(1) PRODUCT_ID PRODUCT_NAME
  • -------- ---------- -------------------------
    -----
  • 29 3331 Monitor 21/HR
  • 27 3060 Monitor 17/HR
  • 14 1726 LCD Monitor 11/PM
  • 14 3054 Plasma Monitor 10/XGA
  • 14 2252 Monitor 21/HR/M
  • 14 2243 Monitor 17/HR/F

5
Integration with the database
  • The attempt to separate text and normal business
    (structured) data fails
  • Cost
  • Complexity
  • High latency of development and deployment
  • Performance

6
No Integration - Separate Everything
File System
Inverted
C API
Oracle Database
B-Tree
SQL
Repository
Index
Search Engine (API)
7
Full Integration text, index, API, optimizer
SQL
Oracle Database
B-Tree
Repository
Index
Search Engine (API)
8
Integration Benefits
  • Low cost
  • Low complexity
  • High performance
  • High integrity
  • Manageability
  • Leveraging existing skills

9
Oracle Uses Oracle Text
  • Oracle internet File System
  • Oracle Portal
  • Oracle CRM
  • Oracle E-Business Suite
  • Oracle eXchange
  • Ultra Search
  • Oracle.com
  • OTN

10
Oracle Internet File System
11
Oracle E-Business Suite
12
Performance illustration
  • Large doc set 100Gig (20million web pages)
  • Hardware Enterprise Sparc
  • Task web query
  • Web-style query syntax
  • 2-3 words
  • Return first 100 hits
  • 40 queries/second
  • 90 of requests take lt 0.5 second
  • 7 hours to create index

13
Performance Query throughput
  • Oracle Text vs one of the best-known specialist
    Text search engines

14
Ease of Use, Ease of Development
  • Simple SQL and PL/SQL interface
  • Can be used by any developer that knows SQL
  • Can be called by any tool that knows SQL
  • Using any language Java, JSP, PL/SQL, C, etc.
  • Choice of datastores
  • Stored in the database
  • Stored in the file system
  • Stored on the web (URL)
  • User-defined datastore

15
Global Solutions
  • Basic indexing/search works in any NLS language
  • Special support for Japanese, Chinese, Korean
  • Theme search and services available in any
    single-byte, white space-delimited language
  • Can mix languages, character sets in a single
    column
  • Can query across languages

16
Chinese, Japanese, Korean Text
  • Character sets
  • Japanese JA16SJIS, JA16EUC
  • Simplified Chinese GBK, GB2312-80
  • Traditional Chinese BIG5, EUC, TRIS
  • Korean KO16KSC5601
  • Unicode UTF8
  • Lexing
  • Lexical segmentation for Japanese, Chinese
  • Morphological segmentation for Korean

17
Multilingual Search
18
Cross-language queries
  • Can mix languages, character sets within a
    document collection (e.g. Chinese and English
    documents)
  • Can use English to query e.g. Chinese terms or
    vice versa.
  • Query a term which is expressed differently in
    simplified and traditional Chinese.
  • select score(1), product_id, product_name
  • from product_information
  • where contains (product_description,
    'TRSYN(monitor, Chinese)', 1)gt0
  • order by score(1) desc
  • Find products whose description contains
    monitor' or its Chinese equivalents.

19
Search Quality
  • Exact word
  • Boolean expression
  • Phrase
  • Proximity
  • Fuzzy
  • Stemming
  • Wildcards
  • Prefix, substring index
  • Thesaurus, multiple Thesauri
  • ABOUT search
  • Theme (concept-based) search
  • Accumulate scores
  • Term weighting
  • Advanced XML search
  • XPath support
  • Query Feedback

20
ABOUT themes and theme queries
  • "We ordered a bottle of chardonnay to go with the
    fish, and cabernet sauvignon for the steak "
  • select id from docswhere contains(text,
    ABOUT(wine)')gt0
  • The knowledge base allows Oracle Text to
    associate words and concepts.
  • Knowledge base contains over 400,000 concepts.
  • You can extend the knowledgebase to include
  • Words and concepts from your specialist field
    e.g. medicine
  • Associations of words and spellings to guide
    novice/internet users

21
Catalog Index
  • Optimized for response time on small text fields
  • True transactional DML
  • Supports structured query, including range query
  • Subset of CONTEXT query language
  • No fuzzy, stemming, about
  • User-friendly web-like query syntax

22
Classification
  • CTXRULE is an index type designed
    classification/routing applications
  • Efficiently take a document and find matching
    queries

Classification Application
Perform Action
Incoming documents
Matched Documents
Compares against rules
9i
23
Prefix, substring index
  • Prefix and Substring are flavors of the CONTEXT
    index
  • Prefix will add more tokens to the CONTEXT index
    to efficiently process prefix searches (e.g.
    'ora')
  • Substring will add an index on substrings of each
    token, to efficiently process substring searches
    (e.g. 'oxy')

24
Storing XML in Oracle
  • Decomposition
  • decompose documents into atomic elements
  • store elements in columns/rows
  • compose XML documents using SQL
  • xmltype
  • store XML as xmltype, use xmltype methods
  • Store as LOB or varchar
  • Store XML as-is, in a LOB or VARCHAR
  • Search using Oracle Text section searching or
    XPath

25
Content search and XML
  • Create index
  • create index BOOKINDEX on BOOKS(text) indextype
    is ctxsys.context
  • Query by content
  • select PRICE from BOOKS
  • where contains(text, Harry Potter')gt0 order by
    price desc
  • Create index to include section info create
    index BOOKINDEX on BOOKS(text) indextype is
    ctxsys.context parameters ('section group
    my_auto_section_group' )
  • Limit content search to a section of text
  • select price from books
  • where contains(text, Harry Potter within
    title)gt0 order by price desc

26
Advanced XML searches
  • Nested section search
  • ltmoviegtlttitlegtThe Matrixlt/titlegtlt/moviegt
  • ltbookgtlttitlegtIntroduction to Matrix
    Algebralt/titlegtlt/bookgt
  • select price from media
  • where contains(desc, matrix within title within
    movie)gt0
  • Search inside attribute values
  • ltbook authorBarry HughartgtBridge of
    Birdslt/bookgt
  • select title from books
  • where contains(text, Hughart within
    book_at_author)gt0

27
More advanced XML searches
  • map multiple tags to same name
  • ltH1gtThe Diamond Agelt/H1gt
  • ltH2gtor, A Young Ladys Illustrated Primerlt/H2gt
  • (map H1 and H2 to section name of headline)
  • select author from articles
  • where contains(text, Diamond within headline)gt0
  • doctype limiters to handle tag collisions
  • lt!DOCTYPE foogt ltaddressgtking_at_world.com
  • lt!DOCTYPE bargt ltaddressgt123 Meheula Pkwy
  • map (foo)address to email, (bar)address to
    address

28
Document Services
  • Extract Themes (major concepts)
  • Extract hierarchical structure
  • Extract Gist
  • Generic or Point-of-View
  • Sentence- or Paragraph- level
  • View a document as HTML
  • Highlight search terms, highlight navigation
  • Return results in a table or a PL/SQL table
  • Basis for Clustering, More Like This,

29
Summary
  • Fully integrated with the database
  • Premier text search quality
  • Advanced features for text management, document
    services, and XML.
  • Best multilingual features in the market.

30
Introducing Oracle Ultra Search
31
Issues in Corporate Search
  • Information Management Crisis
  • Explosive Growth of Information flowing over
    corporate Intranets.
  • Knowledge scattered across IT repositories,
    billions of documents, and data fragments.
  • Non-Uniform Information
  • Structured in databases.
  • Unstructured - Word processing doc.,
    presentations.

32
Impacts of Bad Search
  • Customers - Turn to competitors Website.
  • Employees - Waste time and money on useless
    searches.
  • Oracle Ultra Search
  • Solves problem of finding relevant information.
  • Across your companys many disparate information
    repositories.

33
Oracle Ultra Search
  • Out-of-the-Box solution that
  • Searches text across multiple repositories
  • Databases, HTML Pages, Files, Mail Servers.
  • Provides the best relevance ranking and
    globalization in the industry.
  • Provides value added Portal functionality.
  • Presents Web style interface.
  • Built onto Oracles proven, reliable Text
    Retrieval software and Oracle9i server.

34
Oracle Ultra Search
Docum. Title
Relevance
35
Ultra Search Applications
  • Portal Search
  • Most powerful search for Oracle9iAS Portal.
  • Build your own portal.
  • Special Portlet crawls inside and outside of
    Portal Repository.
  • Canned Web Search for Oracle Text
  • Library or Archive Search
  • Content Management Platform Searc

36
Search Multiple Repositories
37
Value Added Portal Functionality
  • Canned Web-Style Search
  • Aggregates Information For Indexing
  • Documents stay in their own repositories.
  • Search returns normalized results, uniformly
    ranked by relevance.
  • Organize Categorize Content From Multiple
    Repositories
  • Extract valuable metadata.
  • Improve search by narrowing through fielded
    search.

38
Out-of-the-Box Web-Style Search
  • Oracle Text Application
  • Uses public Text interfaces.
  • Enhanced with expertise about gathering and
    indexing information for best quality search.
  • No coding against low level APIs.
  • Oracle Text Retrieval Engine
  • Highly integrated with Oracle9i server.
  • Best interoperability with dynamic data.
  • Scalability and Reliability of Oracle platform.

39
Aggregates Information
  • Gather
  • Crawls Web, corporate repositories
  • Analyze
  • Create index required for querying, filter
  • Make Queryable
  • Embedd through API
  • Maintain
  • Schedule crawling
  • Easy Administration

Gather
Analyze
Maintain
MakeQueryable
40
Powerful Fielded Search
  • Narrow search to parts of document - title, body,
    name of author.
  • Extract and use repository metadata
  • Word processing documents Author, Title.
  • Databases Identify Columns.
  • Email Header/Body/Attachment.
  • Unify repositories in common, logical terms.
  • Uniform set of results, ranked by overall
    relevance.

41
Flexible Metadata Mapping
Search Term
Metadata Fields
Repositories
42
Ultra Search Architecture
43
Architecture
  • Simple, Robust Architecture Built on
  • Oracle9i Server Platform
  • Oracles Text Retrieval Engine
  • Flexible Deployment
  • Server-Tier
  • Mid-Tier

44
Ultra Search Components
  • Crawler
  • Server Component
  • Query API Application
  • Administration Tool
  • Mail API

45
Ultra Search Crawler
  • Multi-Threaded JAVA process.
  • Gathers documents from repositories you specify
    on a set schedule.
  • Maps and analyzes link relationships.
  • Filters (150) Non-HTML Documents, extracts
    valuable metadata.
  • Indexes documents and data fragments.
  • Flexible Configuration
  • Run on one or more machines Remote crawling

46
Ultra Search Crawler
  • Set Inclusion/Exclusion Domains
  • Limit crawling to corporate net or specific
    sections of it.
  • Maintain Fresh Search Results
  • Set crawling schedules for each Web site or
    repository.

47
Crawling Abilities
  • Web Sites (HTTP Protocol)
  • Database Tables
  • Oracle and any ODBC compliant database.
  • Local (Ultra Search instance) or remote database
  • Crawls both fulltext and fielded columns.
  • Files (file// Protocol)
  • Ultra Search filters, extracts text and metadata.
  • Emails (IMAP Protocol)
  • Crawl and index mailing lists through IMAP.

48
Ultra Search Query API
  • Embed Ultra Search in your Portal or
    Application.
  • Customize look-and-feel to your requirements.
  • Easy integration with your application.
  • API for JAVA (JSP) and PL/SQL (PSP).
  • Returns data with or without HTML markup.
  • Build Basic Search Form, Search Result Form...
  • Includes Highly Functional Query Application.

49
Java Query API Illustration
3
1
2
50
PL/SQL Query API Illustration
1
2
3
4
51
Administration Environment
  • Browser-based, Self-Service Web Application.
  • Define Ultra Search Instances.
  • Configure and Schedule Crawler.
  • Set Query Options To Narrow Searches.
  • Document Attributes (e.g. TITLE, AUTHOR).
  • Define Data Source Groups.
  • Manage Administrative Users.

52
Administration Environment
53
Summary
  • Eliminate the Chaos Inside Your Firewalls !
  • Oracle Ultra Search
  • Crawls, Indexes, and makes searchable your
    Intranet.
  • Provides Web-style search without the need for
    coding.
  • Organizes, categorizes, and unifies content from
    multiple repositories.
  • Leverages Oracle9i platform - reliable, scalable,
    always available.

54
A
55
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com