Oracle Database 11g New Search Features and Roadmap - PowerPoint PPT Presentation

About This Presentation
Title:

Oracle Database 11g New Search Features and Roadmap

Description:

Title: Slide 1 Description: This presentation contains information proprietary to Oracle Corporation Last modified by: kiss Created Date: 9/8/2004 11:34:22 PM – PowerPoint PPT presentation

Number of Views:232
Avg rating:3.0/5.0
Slides: 28
Provided by: peopleIn6
Category:

less

Transcript and Presenter's Notes

Title: Oracle Database 11g New Search Features and Roadmap


1
(No Transcript)
2
Oracle Database 11g New Search Features and
Roadmap
  • Roger Ford
  • Senior Principal Product Manager

3
Contents
ltInsert Picture Heregt
  • Oracles Search Products
  • Oracle Text 11g New Features
  • Oracle Text 11.2.0.2 New Features
  • Entity Extraction
  • Name Search
  • Result Set Interface
  • Search Product Roadmap
  • Oracle Text
  • Secure Enterprise Search

4
Oracles Search Products
  • Oracle Text
  • A SQL and PL/SQL based toolkit for creating
    full-text search applications
  • Free with all database versions
  • Previously known as Context Option, interMedia
    Text
  • Secure Enterprise Search
  • A complete search based on Oracle Text
    capabilities
  • Crawlers for datasources such as web, email,
    document repositories, databases
  • End-user query application and APIs for embedding


5
Oracle Text 11g New Features
  • Composite Domain Indexes and SDATA sections
  • Allows storage of structured info (eg numbers,
    dates) within text index
  • Makes for much faster mixed queries
  • Auto Lexer
  • Automatic Language Recognition
  • Segmentation and Stemming for 32 languages
  • Context-sensitive stemming for 23 of these
    languages
  • Off-line and time-limited index creation
  • Enables rebuild of indexes offline in quiet
    periods for true 24x7 operation

6
Demo Auto Lexer
7
11.2.0.2 New Features - Summary
  • Entity Extraction
  • Find entities such as people, countries,
    cities, states, zip codes, phone numbers etc from
    the text
  • Use default dictionary and rules or define your
    own dictionary and rules based on regular
    expressions
  • Name Search (NDATA sections)
  • Inexact searches, copes with mis-spellings,
    segmentation errors, contractions and word
    reversal
  • Useful for many searches, but particular good for
    names
  • ResultSet Interface
  • Query request in XML and results returned as XML
  • Avoids SQL layer and requirement to work within
    SELECT semantics

8
Entity Extraction
  • Indentify names, places, dates, times, etc
  • Tag each occurence with type and subtype
  • Entities are defined by DICTIONARY and RULES
  • Implemented by CTX_ENTITY package
  • create_extract_policy create a policy to which
    you can add extract rules
  • Choose to use/not use built in rules and
    dictionary
  • add_extract_rule create an XML-based rule to
    define an entity
  • add_stop_entity prevent defined entities from
    being used
  • compile build the policy with its rules
  • extract get an XML-based list of entities for a
    doc
  • Also can use ctxload to load user dictionary

9
Demo Entity Extraction
10
Entities built-in types
  • building
  • city
  • company
  • country
  • currency
  • date
  • day
  • email_address
  • geo_political
  • holiday
  • location_other
  • month
  • non_profit
  • organization_other
  • percent
  • person_jobtitle
  • person_name
  • person_other
  • phone_number
  • postal_address
  • product
  • region
  • ssn
  • state
  • time_duration
  • tod
  • url
  • zip_code

11
Entity Extraction Example 1 Defaults
  • ctx_entity.create_extract_policy('my_default_polic
    y')
  • ctx_entity.compile('mypolicy')
  • ctx_entity.extract('mypolicy', mydoc, mylang,
    myresults)
  • Output in "myresults"
  • ltentitiesgt
  • ltentity id"0" offset"75" length"8"
    source"SuppliedDictionary"gt
  • lttextgtNew Yorklt/textgt
  • lttypegtcitylt/typegt
  • lt/entitygt
  • ltentity id"1" offset"55" length"16"
    source"SuppliedRule"gt
  • lttextgtHupplewhite Inc.lt/textgt
  • lttypegtcompanylt/typegt
  • lt/entitygt
  • lt/entitiesgt

12
Entity Extraction Example 2 User rule
  • ctx_entity.create_extract_policy('mypolicy')
  • ctx_entity.add_extract_rule('mypolicy', 5,
    'ltrulegt ltexpressiongt((NorthSouth)?
    America)lt/expressiongt
  •   lttype refid"1"gtxContinentlt/typegt
  • lt/rulegt')
  • ctx_entity.compile('mypolicy')
  • ctx_entity.extract('mypolicy', mydoc, mylang,
    myresults)
  • Note parentheses around expression. refid"1"
    means take the first expression in paren so
    "North America" or just "America".
  • User defined types must be prefixed with a "x"
    hence "xContinent"
  • ltentitiesgt
  • ltentity id"0" offset"75" length"13"
    source"UserRule"gt
  • lttextgtNorth Americalt/textgt
  • lttypegtxContinentlt/typegt
  • lt/entitygt
  • lt/entitiesgt

13
Ent Ext Adding a user dictionary
  • Create file ud.xml
  • ltdictionarygt ltentitiesgt
  • ltentitygt ltvaluegtDow Jones Industrial
    Averagelt/valuegt lttypegtxIndexlt/typegt lt/entitygt
  • ltentitygt ltvaluegtSampP 500lt/valuegt
    lttypegtxIndexlt/typegt lt/entitygt
  • ltentitiesgt lt/dictionarygt
  • Create the policy with CTXLOAD (can add rules
    later)
  • ctxload -user scott/tiger -extract -name pol1
    -file ud.xml
  • Compile the policy
  • ctx_entity.compile('pol1')
  • Results
  • ltentity id"69" offset"1010" length"7"
    source"UserDictionary"gt
  • lttextgtSampP 500lt/textgt
  • lttypegtxIndexlt/typegt
  • lt/entitygt

14
Entity Extraction other stuff
  • Extracting only certain entity types
  • ctx_entity.extract('p1', mydoc, null, myresults,
    'city,company,xContinent')

15
Name Search
  • Searching names has many difficulties
  • Spelling (steven stephen)
  • Alternate Names (fred alfred, chuck charles)
  • Transcription (copying from spoken to written
    form)
  • Transliteration (copying from one writing system
    to another)
  • Segmentation (Mary Jane, Maryjane)
  • First, Middle, and Last Name Classification
  • Name search does intelligent matching across all
    these issues

16
Demo Name Search
17
NDATA section type
  • Basic implementation for name search
  • Limitations
  • 511 characters
  • 255 whitespace-delimited terms
  • No offset information, therefore no
  • Highlighting / Markup
  • NEAR or phrase search with NDATA
  • Uses WORDLIST preference attributes
  • NDATA_ALTERNATE_SPELLING
  • NDATA_BASE_LETTER
  • NDATA_THESAURUS (for alternate names default
    thesaurus provided)
  • NDATA_JOIN_PARTICLES (list such as
    'dedumcmac')
  • Query Syntax
  • NDATA(fieldname, search terms , order ,
    proximity )

18
Result Set Interface
  • Some queries are difficult to express in SQL
  • eg "Give me the top 5 hits in each category"
  • Result set interface uses a simple text query and
    an XML result set descriptor
  • Hitlist is returned in XML according to result
    set descriptor
  • Uses SDATA sections for
  • Grouping
  • Counting

19
Result Set Example Query
  • ctx_query.result_set('docidx', 'oracle',
  • 'ltctx_result_set_descriptorgt
  • ltcount/gt
  • lthitlist start_hit_num"1"
    end_hit_num"2" order"pubDate desc, score desc"gt
  • ltscore/gt ltrowid/gt
  • ltsdata name"author"/gt
  • ltsdata name"pubDate"/gt
  • lt/hitlistgt
  • ltgroup sdata"pubDate"gt
  • ltcount/gt
  • lt/groupgt
  • ltgroup sdata"author"gt
  • ltcount/gt
  • lt/groupgt
  • lt/ctx_result_set_descriptorgt ', rs)

20
Result Set Output
  • ltctx_result_setgt
  • lthitlistgt
  • lthitgt
  • ltscoregt3lt/scoregtltrowidgtAAAPoEAABAAAMWsAAClt/r
    owidgt
  • ltsdata name"AUTHOR"gtJohnlt/sdatagt
  • ltsdata name"PUBDATE"gt2001-01-03
    000000lt/sdatagt
  • lt/hitgt
  • lthitgt
  • ltscoregt3lt/scoregtltrowidgtAAAPoEAABAAAMWsAAGlt/r
    owidgt
  • ltsdata name"AUTHOR"gtJohnlt/sdatagt
  • ltsdata name"PUBDATE"gt2001-01-03
    000000lt/sdatagt
  • lt/hitgt
  • lt/hitlistgt
  • ltcountgt100lt/countgt

21
Result Set Output - Continued
  • ltgroups sdata"PUBDATE"gt
  • ltgroup value"2001-01-01 000000"gtltcountgt25lt/
    countgtlt/groupgt
  • ltgroup value"2001-01-02 000000"gtltcountgt50lt/
    countgtlt/groupgt
  • ltgroup value"2001-01-03 000000"gtltcountgt25lt/
    countgtlt/groupgt
  • lt/groupsgt
  • ltgroups sdata"AUTHOR"gt
  • ltgroup value"John"gtltcountgt50lt/countgtlt/groupgt
  • ltgroup value"Mike"gtltcountgt25lt/countgtlt/groupgt
  • ltgroup value"Steve"gtltcountgt25lt/countgtlt/groupgt
  • lt/groupsgt
  • lt/ctx_result_setgt

22
Preview
23
Roadmap merging Text and SES
Secure Enterprise Search
Oracle Text
Full Control
Full Featured
  • Fine-grained Index Options
  • Data Storage Options
  • Lexer Options
  • Stoplists
  • Use existing database
  • RAC, Exadata
  • Built in database and mid-tier
  • Crawlers for many sources
  • Simple Query Interface
  • End user GUI / API
  • Embedded security

24
Coming Search Features
  • Natural Language Processing enhancements
  • Ontology based classification
  • Question answering
  • Automatic Partitioning
  • Query load load balancing
  • Full support for facetted navigation (MVDATA
    sections)
  • Functional completeness for Result Set Interface
  • Result Iterator streaming support
  • Parallel Query
  • Replication Support
  • Golden Gate / Logical Standby / Streams
  • Operator improvements
  • NEAR2 best query in one operator
  • MNOT mild not, eg YORK mnot NEW YORK
  • Nested near
  • Substring index and query performance improvements

25
Coming Search Features - Continued
  • Multiple enhancements to query performance
  • BIGIO leverages Secure Files CLOBs
  • Automatic optimization of indexes with stage
    index
  • Two level index keep common search terms in
    memory
  • Partition maintenance without reindexing
  • Off-load filtering from database server
  • Section specific index options
  • Choose different options, eg language, stopwords,
    PRINTJOINS for each section
  • Regular expression based stopwords
  • Forward Index
  • Hugely improved performance for highlighting,
    snippets
  • PDF Native Highlighting
  • Unlimited SDATA, MDATA and Field Sections

26
The preceding is intended to outline our general
product direction. It is intended for information
purposes only, and may not be incorporated into
any contract. It is not a commitment to deliver
any material, code, or functionality, and should
not be relied upon in making purchasing
decisions.The development, release, and timing
of any features or functionality described for
Oracles products remains at the sole discretion
of Oracle.
27
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com