Full Text Search Engine Federico Ramallo Product Technical Specialist Enterprise Group Microsoft Cor - PowerPoint PPT Presentation

1 / 45
About This Presentation
Title:

Full Text Search Engine Federico Ramallo Product Technical Specialist Enterprise Group Microsoft Cor

Description:

Productos utilizan variantes engine full text search ... Components shared by the Search and Index engines that break up compound words and phrases. ... – PowerPoint PPT presentation

Number of Views:216
Avg rating:3.0/5.0
Slides: 46
Provided by: technicals6
Category:

less

Transcript and Presenter's Notes

Title: Full Text Search Engine Federico Ramallo Product Technical Specialist Enterprise Group Microsoft Cor


1
Full Text Search Engine Federico Ramallo
Product Technical SpecialistEnterprise
GroupMicrosoft Corporationt-feder_at_microsoft.com
2
Agenda
  • Benefits of Search
  • Arquitectura
  • Management of Search
  • Performance and Diagnostics
  • Making search more meaningful

3
Benefit Of Full-text Indexing And Search
  • Full-text indexing and search is a core
    technology for working with information
  • Information is stored in a variety of content
    sources
  • Indexing this content allows for fast, relevant
    retrieval

4
Productos utilizan variantes engine full text
search
  • Index Server, Indexing Service for Microsoft
    Windows
  • Microsoft SharePoint Portal Server 2001
  • Microsoft SQL Server 7 and SQL Server 2000
  • Microsoft Exchange Server 2000
  • Microsoft Site Server 3
  • Microsoft Office XP

5
Arquitectura
Content Sources
Documents
6
Search Architecture
SPS
Portal
MS Search Service
Search Page
Gateway
Index Page
Search Engine
Query
Full Text Index
Web Page
ProtocolHandlers
Content Source
User
7
Management of Search
8
Managing the Server
  • Create / delete a workspace
  • An additional workspace on server
  • A dedicated content indexing workspace
  • Configure
  • Resource usage
  • Server hit rate
  • Data locations
  • Authentication and network settings
  • Set up special content sources

9
Index Exchange 5.5 Servers
  • Access one Exchange Server
  • Public folders
  • Requires Admin role on site and site
    configuration containers
  • Outlook 2000 and CDO required

10
Index Lotus Notes Servers
  • Index Lotus Notes 4.6 or R5 Servers
  • R5 client required on the SharePoint Portal
    Server
  • My index multiple servers
  • Security mapping
  • Translates Notes to NT ACLs
  • Individual or groups
  • No encryption support

11
Managing Workspaces
  • Targeted at coordinators
  • Available in MMC and Shell
  • Tasks
  • Assign users as coordinators
  • Handle top-level security
  • Configure gatherer logging parameters
  • Set up version pruning
  • Configure subscription and discussion store limits

12
Content Sources Wizards
  • File
  • \\server\share, file//server/share/folder
  • Web Sites
  • Straight http
  • Exchange 5.5 (MAPI)
  • Exchange 2000 (WebDav)
  • Lotus Notes
  • Other SharePoint Portal Server Workspaces

13
Content Source WizardLotus Notes
  • Pick a known Lotus Notes server
  • Pick a database from list
  • Pick a SharePoint Portal Server profileto
    capture the information
  • Map Notes fields to SharePoint Portal Server
    properties

14
What the Wizards Wont Tell You
  • Configure hops and depths
  • Associate with search scope
  • Schedule an update
  • Use site path rules

15
Site Path Rules, etc.
  • Include or exclude servers, folders
  • Associate access accounts for authentication
    during gathering
  • Include or exclude file types from filtering
  • Enable complex URLs

16
  • Set up a web site crawl

17
Using the Log Viewer
  • Get deeper insight into the crawl
  • Analyze excluded documents
  • Analyze involved hosts
  • Some hosts dont like you, but
  • You dont like other hosts
  • Get further information

18
Search Performance and Diagnostics
19
Indexing Performance
  • Crawl tuning
  • Adaptive crawls
  • Disk tuning
  • Indexing Metrics
  • Event Viewer
  • Performance counters
  • Content source detailed log

20
Full Crawl Performance
21
Adaptive Crawling
  • Boosts the efficiency of incremental crawls on
    large content indexes
  • Predicts the likelihood that a document has
    changed
  • Sampling used to assess accuracy
  • Use Performance control panel to assess adaptive
    crawl performance

22
Adaptive Crawl Sample
  • Documents modified per crawl attempt

23
Adaptive Crawl Metrics
24
Event Viewer Metrics
  • Full, incremental and adaptive crawls
  • Start and stop times
  • Counts of successful and failing documents
  • Propagation
  • Need to check on both DCI and portal
  • Start and stop times
  • DCI-to-portal copy time
  • Portal initialization time

25
Using PerfMon
  • Start the Performance control panel
  • Select from the Search performance objects
  • Select counters from the performance object

26
Performance Counters
  • Performance control panel provides central
    location for metrics
  • Search provides counters for
  • Gatherer, indexer and query engine
  • Per-catalog metrics
  • Process performance object for CPU, memory, I/O

27
  • Use PerfMon to check number of documents and
    indexing rate

28
Disk Tuning
  • Use disk striping (RAID level 5)
  • Spread Search data files across disks
  • OS volume
  • Temporary directories
  • Content index
  • Search property store
  • Gather logs

29
Making Search Meaningful
30
Meaningful Search
  • Category Assistant
  • Understanding Portal Search
  • SharePoint Portal Server schema
  • Search predicates
  • Auto-nomination of best bets
  • Property Mapping
  • Namespaces
  • Tips and Tricks

31
Category Assistant
  • Automatically categorizes large numbers of
    documents
  • Uses a set of training documents as examples
  • Determines the characteristic words for each
    category
  • Efficient Support Vector Machine (SVM) algorithm
    from Microsoft Research

32
Category Assistant Steps
  • Create at least 5 categories
  • Assign at least 10 documents to each category
  • Enable the Category Assistant
  • Click Train Now
  • Select internal or external documents
  • Index external content

33
Portal Search Internals
  • SharePoint Portal Server schema
  • Search property store
  • Indexed and retrievable properties
  • BestBets property
  • Property weighting
  • Portal search query

34
Schema
  • Documents are assigned profiles
  • Profiles are collections of properties
  • Properties have attributes, including
  • Required
  • Content indexed
  • Retrievable
  • Profiles and properties are known by URNs

35
Query ArchitectureContent Index and Property
Store
Full-text Engine
Query Engine
RelationalProperty ValueQueries
Full-textQuery
Web Storage System
36
Portal Search Query
Rank 1000
Exact match on BestBets value
Rank 999
CONTAINS predicate on BestBets value
Rank999to500
FREETEXT predicate on BestBets value
Rank600to0
FREETEXT predicate on weighted properties
37
Weighted Properties
  • Relevant properties are strongly weighted
  • Title, Subject, Keywords
  • Other properties receive a default weight
  • Content, custom properties
  • Poor indicators of content receive a weight of 0
  • Parent folder name

38
Improve Query Results
  • Tag documents with Categories and specific
    BestBets values
  • Use the Office document properties
  • Search for specific words or phrases in the
    document content or properties

39
Property Mapping
  • External content sources may have well-known
    properties or meta-tags
  • Map source tags to document profiles and
    properties
  • Same mechanism used by the Notes Content Source
    wizard

40
Property Mapping Steps
  • Create a document profile
  • Profile can include custom properties
  • Create an external content source
  • But dont index yet . . .
  • Run property mapping script
  • Maps source properties to target properties
  • Restart the SharePoint Portal services
  • Index the content source

41
Property Mapping Tips
  • Namespaces for HTML tags, SharePoint Portal
    properties
  • Namespace for target properties
  • First property occurrence wins
  • Supports scalar, text properties only
  • Mapping required on both DCI and portal machines
  • For more info
  • Whitepaper from MikeFitz/SidH

42
Extending Search
  • IFilter extracts text from documents
  • Text, HTML, Office, TIFF provided
  • MIME provided with Windows 2000
  • Platform SDK documents interfaces
  • Protocol Handlers crawls content source
  • File, HTTP, Notes, Exchange 5.5 provided
  • SharePoint Portal Server SDK (RTM) document
    interfaces

43
Summary
  • Content source wizards and log viewer ease
    administration
  • Indexing is faster!
  • Adaptive crawling can decrease crawl times even
    more!
  • Improved search results
  • Category Assistant and hierarchies
  • Best bet tagging for queries
  • Property mapping scripts

44
(No Transcript)
45
(No Transcript)
46
Reference
47
Arquitectura Ref1
  • Search Engine. Component of MSSearch that runs
    queries written in the SQL full-text extension
    syntax against the full-text index.
  • Index Engine. Component of MSSearch that
    processes chunks of text and properties filtered
    from content sources, and determines which
    properties are written to the full-text index.
  • Gatherer. Component of MSSearch that manages the
    content crawling process and that has rules that
    determine what content is crawled.

48
Arquitectura Ref2
  • Word breakers. Components shared by the Search
    and Index engines that break up compound words
    and phrases.
  • Stemmers. Components shared by the Search and
    Index engines that generate inflected forms of a
    word.
  • Filter Daemon. Component that handles requests
    from the Gatherer. Uses protocol handlers to
    access content sources, and IFilters to filter
    files. Provides the Gatherer with a stream of
    data containing filtered chunks and properties.
  • Protocol Handlers. Open content sources in their
    native protocol and expose documents and other
    items to be filtered.

49
Arquitectura Ref3
  • IFilters. Open documents and other content source
    items in their native format and filter into
    chunks of text and properties.
  • Content sources. Collection of data MSSearch must
    crawl, and specific rules for crawling items in
    that content source. Items in content sources are
    identified by URLs. The protocol portion of the
    URL is what distinguishes different types of
    content sources.
  • Data Access. SharePoint Portal Server uses
    protocol handlers and the Gatherer to crawl and
    provide search results over data from diverse
    content sources. Without modification, SharePoint
    Portal Server can crawl documents from file
    systems, Web sites, Exchange 2000 Server and
    Exchange Server 5.5 computers, Lotus Notes
    servers, and other SharePoint Portal Server
    workspaces.

50
Sorteo!!!
  • Remeras!!!
Write a Comment
User Comments (0)
About PowerShow.com