Title: Text in Oracle The Search Platform and Ultra Search
1Text in OracleThe Search Platform and Ultra
Search
Omar Alonso, Senior Product Manager, Oracle
Corp. Stefan Buchta, Principal Product Manager,
Oracle Corp. NoCOUG May 16th 2001
2Agenda
- What is Oracle Text?
- Introducing Oracle Text
- Text in the database Why Integration is Key
- Performance and scalability
- Ease of Use
- Global Solutions
- Search Quality
- Specialized Indexes
- XML
- Document Services
- Ultra Search
- Summary
3What is Oracle Text?
- Formerly know as interMedia Text
- Oracle Text adds powerful text search and
intelligent text management capabilities to the
Oracle database. - Oracle Text
- Fully integrated with the database
- Offers premier text search quality
- Provides several advances features for text
management, document services, XML, etc. - Has the best internationalization set of features
for multilingual text search applications.
4Introducing Oracle Text An example
- create index description_idx on
- PRODUCT_INFORMATION(PRODUCT_DESCRIPTION)
- indextype is ctxsys.context
- select score(1), product_id, product_name
- from product_information
- where contains (product_description, 'monitor
NEAR "high resolution"', 1)gt0 - order by score(1) desc
- SCORE(1) PRODUCT_ID PRODUCT_NAME
- -------- ---------- -------------------------
----- - 29 3331 Monitor 21/HR
- 27 3060 Monitor 17/HR
- 14 1726 LCD Monitor 11/PM
- 14 3054 Plasma Monitor 10/XGA
- 14 2252 Monitor 21/HR/M
- 14 2243 Monitor 17/HR/F
5Integration with the database
- The attempt to separate text and normal business
(structured) data fails - Cost
- Complexity
- High latency of development and deployment
- Performance
6 No Integration - Separate Everything
File System
Inverted
C API
Oracle Database
B-Tree
SQL
Repository
Index
Search Engine (API)
7Full Integration text, index, API, optimizer
SQL
Oracle Database
B-Tree
Repository
Index
Search Engine (API)
8Integration Benefits
- Low cost
- Low complexity
- High performance
- High integrity
- Manageability
- Leveraging existing skills
9Oracle Uses Oracle Text
- Oracle internet File System
- Oracle Portal
- Oracle CRM
- Oracle E-Business Suite
- Oracle eXchange
- Ultra Search
- Oracle.com
- OTN
10Oracle Internet File System
11Oracle E-Business Suite
12Performance illustration
- Large doc set 100Gig (20million web pages)
- Hardware Enterprise Sparc
- Task web query
- Web-style query syntax
- 2-3 words
- Return first 100 hits
- 40 queries/second
- 90 of requests take lt 0.5 second
- 7 hours to create index
13Performance Query throughput
- Oracle Text vs one of the best-known specialist
Text search engines
14Ease of Use, Ease of Development
- Simple SQL and PL/SQL interface
- Can be used by any developer that knows SQL
- Can be called by any tool that knows SQL
- Using any language Java, JSP, PL/SQL, C, etc.
- Choice of datastores
- Stored in the database
- Stored in the file system
- Stored on the web (URL)
- User-defined datastore
15Global Solutions
- Basic indexing/search works in any NLS language
- Special support for Japanese, Chinese, Korean
- Theme search and services available in any
single-byte, white space-delimited language - Can mix languages, character sets in a single
column - Can query across languages
16Chinese, Japanese, Korean Text
- Character sets
- Japanese JA16SJIS, JA16EUC
- Simplified Chinese GBK, GB2312-80
- Traditional Chinese BIG5, EUC, TRIS
- Korean KO16KSC5601
- Unicode UTF8
- Lexing
- Lexical segmentation for Japanese, Chinese
- Morphological segmentation for Korean
17Multilingual Search
18Cross-language queries
- Can mix languages, character sets within a
document collection (e.g. Chinese and English
documents) - Can use English to query e.g. Chinese terms or
vice versa. - Query a term which is expressed differently in
simplified and traditional Chinese. - select score(1), product_id, product_name
- from product_information
- where contains (product_description,
'TRSYN(monitor, Chinese)', 1)gt0 - order by score(1) desc
- Find products whose description contains
monitor' or its Chinese equivalents.
19Search Quality
- Exact word
- Boolean expression
- Phrase
- Proximity
- Fuzzy
- Stemming
- Wildcards
- Prefix, substring index
- Thesaurus, multiple Thesauri
- ABOUT search
- Theme (concept-based) search
- Accumulate scores
- Term weighting
- Advanced XML search
- XPath support
- Query Feedback
20ABOUT themes and theme queries
- "We ordered a bottle of chardonnay to go with the
fish, and cabernet sauvignon for the steak " - select id from docswhere contains(text,
ABOUT(wine)')gt0 - The knowledge base allows Oracle Text to
associate words and concepts. - Knowledge base contains over 400,000 concepts.
- You can extend the knowledgebase to include
- Words and concepts from your specialist field
e.g. medicine - Associations of words and spellings to guide
novice/internet users
21Catalog Index
- Optimized for response time on small text fields
- True transactional DML
- Supports structured query, including range query
- Subset of CONTEXT query language
- No fuzzy, stemming, about
- User-friendly web-like query syntax
22Classification
- CTXRULE is an index type designed
classification/routing applications - Efficiently take a document and find matching
queries
Classification Application
Perform Action
Incoming documents
Matched Documents
Compares against rules
9i
23Prefix, substring index
- Prefix and Substring are flavors of the CONTEXT
index - Prefix will add more tokens to the CONTEXT index
to efficiently process prefix searches (e.g.
'ora') - Substring will add an index on substrings of each
token, to efficiently process substring searches
(e.g. 'oxy')
24Storing XML in Oracle
- Decomposition
- decompose documents into atomic elements
- store elements in columns/rows
- compose XML documents using SQL
- xmltype
- store XML as xmltype, use xmltype methods
- Store as LOB or varchar
- Store XML as-is, in a LOB or VARCHAR
- Search using Oracle Text section searching or
XPath
25Content search and XML
- Create index
- create index BOOKINDEX on BOOKS(text) indextype
is ctxsys.context - Query by content
- select PRICE from BOOKS
- where contains(text, Harry Potter')gt0 order by
price desc - Create index to include section info create
index BOOKINDEX on BOOKS(text) indextype is
ctxsys.context parameters ('section group
my_auto_section_group' ) - Limit content search to a section of text
- select price from books
- where contains(text, Harry Potter within
title)gt0 order by price desc
26Advanced XML searches
- Nested section search
- ltmoviegtlttitlegtThe Matrixlt/titlegtlt/moviegt
- ltbookgtlttitlegtIntroduction to Matrix
Algebralt/titlegtlt/bookgt - select price from media
- where contains(desc, matrix within title within
movie)gt0 - Search inside attribute values
- ltbook authorBarry HughartgtBridge of
Birdslt/bookgt - select title from books
- where contains(text, Hughart within
book_at_author)gt0
27More advanced XML searches
- map multiple tags to same name
- ltH1gtThe Diamond Agelt/H1gt
- ltH2gtor, A Young Ladys Illustrated Primerlt/H2gt
- (map H1 and H2 to section name of headline)
- select author from articles
- where contains(text, Diamond within headline)gt0
- doctype limiters to handle tag collisions
- lt!DOCTYPE foogt ltaddressgtking_at_world.com
- lt!DOCTYPE bargt ltaddressgt123 Meheula Pkwy
- map (foo)address to email, (bar)address to
address
28Document Services
- Extract Themes (major concepts)
- Extract hierarchical structure
- Extract Gist
- Generic or Point-of-View
- Sentence- or Paragraph- level
- View a document as HTML
- Highlight search terms, highlight navigation
- Return results in a table or a PL/SQL table
- Basis for Clustering, More Like This,
29Summary
- Fully integrated with the database
- Premier text search quality
- Advanced features for text management, document
services, and XML. - Best multilingual features in the market.
30Introducing Oracle Ultra Search
31Issues in Corporate Search
- Information Management Crisis
- Explosive Growth of Information flowing over
corporate Intranets. - Knowledge scattered across IT repositories,
billions of documents, and data fragments. - Non-Uniform Information
- Structured in databases.
- Unstructured - Word processing doc.,
presentations.
32Impacts of Bad Search
- Customers - Turn to competitors Website.
- Employees - Waste time and money on useless
searches. - Oracle Ultra Search
- Solves problem of finding relevant information.
- Across your companys many disparate information
repositories.
33Oracle Ultra Search
- Out-of-the-Box solution that
- Searches text across multiple repositories
- Databases, HTML Pages, Files, Mail Servers.
- Provides the best relevance ranking and
globalization in the industry. - Provides value added Portal functionality.
- Presents Web style interface.
- Built onto Oracles proven, reliable Text
Retrieval software and Oracle9i server.
34Oracle Ultra Search
Docum. Title
Relevance
35Ultra Search Applications
- Portal Search
- Most powerful search for Oracle9iAS Portal.
- Build your own portal.
- Special Portlet crawls inside and outside of
Portal Repository. - Canned Web Search for Oracle Text
- Library or Archive Search
- Content Management Platform Searc
36Search Multiple Repositories
37Value Added Portal Functionality
- Canned Web-Style Search
- Aggregates Information For Indexing
- Documents stay in their own repositories.
- Search returns normalized results, uniformly
ranked by relevance. - Organize Categorize Content From Multiple
Repositories - Extract valuable metadata.
- Improve search by narrowing through fielded
search.
38Out-of-the-Box Web-Style Search
- Oracle Text Application
- Uses public Text interfaces.
- Enhanced with expertise about gathering and
indexing information for best quality search. - No coding against low level APIs.
- Oracle Text Retrieval Engine
- Highly integrated with Oracle9i server.
- Best interoperability with dynamic data.
- Scalability and Reliability of Oracle platform.
39Aggregates Information
- Gather
- Crawls Web, corporate repositories
- Analyze
- Create index required for querying, filter
- Make Queryable
- Embedd through API
- Maintain
- Schedule crawling
- Easy Administration
Gather
Analyze
Maintain
MakeQueryable
40Powerful Fielded Search
- Narrow search to parts of document - title, body,
name of author. - Extract and use repository metadata
- Word processing documents Author, Title.
- Databases Identify Columns.
- Email Header/Body/Attachment.
- Unify repositories in common, logical terms.
- Uniform set of results, ranked by overall
relevance.
41Flexible Metadata Mapping
Search Term
Metadata Fields
Repositories
42Ultra Search Architecture
43Architecture
- Simple, Robust Architecture Built on
- Oracle9i Server Platform
- Oracles Text Retrieval Engine
- Flexible Deployment
- Server-Tier
- Mid-Tier
44Ultra Search Components
- Crawler
- Server Component
- Query API Application
- Administration Tool
- Mail API
45Ultra Search Crawler
- Multi-Threaded JAVA process.
- Gathers documents from repositories you specify
on a set schedule. - Maps and analyzes link relationships.
- Filters (150) Non-HTML Documents, extracts
valuable metadata. - Indexes documents and data fragments.
- Flexible Configuration
- Run on one or more machines Remote crawling
46Ultra Search Crawler
- Set Inclusion/Exclusion Domains
- Limit crawling to corporate net or specific
sections of it. - Maintain Fresh Search Results
- Set crawling schedules for each Web site or
repository.
47Crawling Abilities
- Web Sites (HTTP Protocol)
- Database Tables
- Oracle and any ODBC compliant database.
- Local (Ultra Search instance) or remote database
- Crawls both fulltext and fielded columns.
- Files (file// Protocol)
- Ultra Search filters, extracts text and metadata.
- Emails (IMAP Protocol)
- Crawl and index mailing lists through IMAP.
48Ultra Search Query API
- Embed Ultra Search in your Portal or
Application. - Customize look-and-feel to your requirements.
- Easy integration with your application.
- API for JAVA (JSP) and PL/SQL (PSP).
- Returns data with or without HTML markup.
- Build Basic Search Form, Search Result Form...
- Includes Highly Functional Query Application.
49Java Query API Illustration
3
1
2
50PL/SQL Query API Illustration
1
2
3
4
51Administration Environment
- Browser-based, Self-Service Web Application.
- Define Ultra Search Instances.
- Configure and Schedule Crawler.
- Set Query Options To Narrow Searches.
- Document Attributes (e.g. TITLE, AUTHOR).
- Define Data Source Groups.
- Manage Administrative Users.
52Administration Environment
53Summary
- Eliminate the Chaos Inside Your Firewalls !
- Oracle Ultra Search
- Crawls, Indexes, and makes searchable your
Intranet. - Provides Web-style search without the need for
coding. - Organizes, categorizes, and unifies content from
multiple repositories. - Leverages Oracle9i platform - reliable, scalable,
always available.
54A
55(No Transcript)