Title: Full Text Search Engine Federico Ramallo Product Technical Specialist Enterprise Group Microsoft Cor
1 Full Text Search Engine Federico Ramallo
Product Technical SpecialistEnterprise
GroupMicrosoft Corporationt-feder_at_microsoft.com
2Agenda
- Benefits of Search
- Arquitectura
- Management of Search
- Performance and Diagnostics
- Making search more meaningful
3Benefit Of Full-text Indexing And Search
- Full-text indexing and search is a core
technology for working with information - Information is stored in a variety of content
sources - Indexing this content allows for fast, relevant
retrieval
4Productos utilizan variantes engine full text
search
- Index Server, Indexing Service for Microsoft
Windows - Microsoft SharePoint Portal Server 2001
- Microsoft SQL Server 7 and SQL Server 2000
- Microsoft Exchange Server 2000
- Microsoft Site Server 3
- Microsoft Office XP
5Arquitectura
Content Sources
Documents
6Search Architecture
SPS
Portal
MS Search Service
Search Page
Gateway
Index Page
Search Engine
Query
Full Text Index
Web Page
ProtocolHandlers
Content Source
User
7Management of Search
8Managing the Server
- Create / delete a workspace
- An additional workspace on server
- A dedicated content indexing workspace
- Configure
- Resource usage
- Server hit rate
- Data locations
- Authentication and network settings
- Set up special content sources
9Index Exchange 5.5 Servers
- Access one Exchange Server
- Public folders
- Requires Admin role on site and site
configuration containers - Outlook 2000 and CDO required
10Index Lotus Notes Servers
- Index Lotus Notes 4.6 or R5 Servers
- R5 client required on the SharePoint Portal
Server - My index multiple servers
- Security mapping
- Translates Notes to NT ACLs
- Individual or groups
- No encryption support
11Managing Workspaces
- Targeted at coordinators
- Available in MMC and Shell
- Tasks
- Assign users as coordinators
- Handle top-level security
- Configure gatherer logging parameters
- Set up version pruning
- Configure subscription and discussion store limits
12Content Sources Wizards
- File
- \\server\share, file//server/share/folder
- Web Sites
- Straight http
- Exchange 5.5 (MAPI)
- Exchange 2000 (WebDav)
- Lotus Notes
- Other SharePoint Portal Server Workspaces
13Content Source WizardLotus Notes
- Pick a known Lotus Notes server
- Pick a database from list
- Pick a SharePoint Portal Server profileto
capture the information - Map Notes fields to SharePoint Portal Server
properties
14What the Wizards Wont Tell You
- Configure hops and depths
- Associate with search scope
- Schedule an update
- Use site path rules
15Site Path Rules, etc.
- Include or exclude servers, folders
- Associate access accounts for authentication
during gathering - Include or exclude file types from filtering
- Enable complex URLs
16 17Using the Log Viewer
- Get deeper insight into the crawl
- Analyze excluded documents
- Analyze involved hosts
- Some hosts dont like you, but
- You dont like other hosts
- Get further information
18Search Performance and Diagnostics
19Indexing Performance
- Crawl tuning
- Adaptive crawls
- Disk tuning
- Indexing Metrics
- Event Viewer
- Performance counters
- Content source detailed log
20Full Crawl Performance
21Adaptive Crawling
- Boosts the efficiency of incremental crawls on
large content indexes - Predicts the likelihood that a document has
changed - Sampling used to assess accuracy
- Use Performance control panel to assess adaptive
crawl performance
22Adaptive Crawl Sample
- Documents modified per crawl attempt
-
23Adaptive Crawl Metrics
24Event Viewer Metrics
- Full, incremental and adaptive crawls
- Start and stop times
- Counts of successful and failing documents
- Propagation
- Need to check on both DCI and portal
- Start and stop times
- DCI-to-portal copy time
- Portal initialization time
25Using PerfMon
- Start the Performance control panel
- Select from the Search performance objects
- Select counters from the performance object
26Performance Counters
- Performance control panel provides central
location for metrics - Search provides counters for
- Gatherer, indexer and query engine
- Per-catalog metrics
- Process performance object for CPU, memory, I/O
27- Use PerfMon to check number of documents and
indexing rate
28Disk Tuning
- Use disk striping (RAID level 5)
- Spread Search data files across disks
- OS volume
- Temporary directories
- Content index
- Search property store
- Gather logs
29Making Search Meaningful
30Meaningful Search
- Category Assistant
- Understanding Portal Search
- SharePoint Portal Server schema
- Search predicates
- Auto-nomination of best bets
- Property Mapping
- Namespaces
- Tips and Tricks
31Category Assistant
- Automatically categorizes large numbers of
documents - Uses a set of training documents as examples
- Determines the characteristic words for each
category - Efficient Support Vector Machine (SVM) algorithm
from Microsoft Research
32Category Assistant Steps
- Create at least 5 categories
- Assign at least 10 documents to each category
- Enable the Category Assistant
- Click Train Now
- Select internal or external documents
- Index external content
33Portal Search Internals
- SharePoint Portal Server schema
- Search property store
- Indexed and retrievable properties
- BestBets property
- Property weighting
- Portal search query
34Schema
- Documents are assigned profiles
- Profiles are collections of properties
- Properties have attributes, including
- Required
- Content indexed
- Retrievable
- Profiles and properties are known by URNs
35Query ArchitectureContent Index and Property
Store
Full-text Engine
Query Engine
RelationalProperty ValueQueries
Full-textQuery
Web Storage System
36Portal Search Query
Rank 1000
Exact match on BestBets value
Rank 999
CONTAINS predicate on BestBets value
Rank999to500
FREETEXT predicate on BestBets value
Rank600to0
FREETEXT predicate on weighted properties
37Weighted Properties
- Relevant properties are strongly weighted
- Title, Subject, Keywords
- Other properties receive a default weight
- Content, custom properties
- Poor indicators of content receive a weight of 0
- Parent folder name
38Improve Query Results
- Tag documents with Categories and specific
BestBets values - Use the Office document properties
- Search for specific words or phrases in the
document content or properties
39Property Mapping
- External content sources may have well-known
properties or meta-tags -
- Map source tags to document profiles and
properties - Same mechanism used by the Notes Content Source
wizard
40Property Mapping Steps
- Create a document profile
- Profile can include custom properties
- Create an external content source
- But dont index yet . . .
- Run property mapping script
- Maps source properties to target properties
- Restart the SharePoint Portal services
- Index the content source
41Property Mapping Tips
- Namespaces for HTML tags, SharePoint Portal
properties - Namespace for target properties
- First property occurrence wins
- Supports scalar, text properties only
- Mapping required on both DCI and portal machines
- For more info
- Whitepaper from MikeFitz/SidH
42Extending Search
- IFilter extracts text from documents
- Text, HTML, Office, TIFF provided
- MIME provided with Windows 2000
- Platform SDK documents interfaces
- Protocol Handlers crawls content source
- File, HTTP, Notes, Exchange 5.5 provided
- SharePoint Portal Server SDK (RTM) document
interfaces
43Summary
- Content source wizards and log viewer ease
administration - Indexing is faster!
- Adaptive crawling can decrease crawl times even
more! - Improved search results
- Category Assistant and hierarchies
- Best bet tagging for queries
- Property mapping scripts
44(No Transcript)
45(No Transcript)
46Reference
47Arquitectura Ref1
- Search Engine. Component of MSSearch that runs
queries written in the SQL full-text extension
syntax against the full-text index. - Index Engine. Component of MSSearch that
processes chunks of text and properties filtered
from content sources, and determines which
properties are written to the full-text index. - Gatherer. Component of MSSearch that manages the
content crawling process and that has rules that
determine what content is crawled.
48Arquitectura Ref2
- Word breakers. Components shared by the Search
and Index engines that break up compound words
and phrases. - Stemmers. Components shared by the Search and
Index engines that generate inflected forms of a
word. - Filter Daemon. Component that handles requests
from the Gatherer. Uses protocol handlers to
access content sources, and IFilters to filter
files. Provides the Gatherer with a stream of
data containing filtered chunks and properties. - Protocol Handlers. Open content sources in their
native protocol and expose documents and other
items to be filtered.
49Arquitectura Ref3
- IFilters. Open documents and other content source
items in their native format and filter into
chunks of text and properties. - Content sources. Collection of data MSSearch must
crawl, and specific rules for crawling items in
that content source. Items in content sources are
identified by URLs. The protocol portion of the
URL is what distinguishes different types of
content sources. - Data Access. SharePoint Portal Server uses
protocol handlers and the Gatherer to crawl and
provide search results over data from diverse
content sources. Without modification, SharePoint
Portal Server can crawl documents from file
systems, Web sites, Exchange 2000 Server and
Exchange Server 5.5 computers, Lotus Notes
servers, and other SharePoint Portal Server
workspaces.
50Sorteo!!!