Prof. Ray Larson presentation

About This Presentation

Transcript and Presenter's Notes

Title: Prof. Ray Larson

1
Lecture 23 Web Searching
Principles of Information Retrieval

Prof. Ray Larson
University of California, Berkeley
School of Information
Tuesday and Thursday 1030 am - 1200 pm
Spring 2007
http//courses.ischool.berkeley.edu/i240/s07

2
Mini-TREC

Proposed Schedule
February 15 Database and previous Queries
February 27 report on system acquisition and
setup
March 8, New Queries for testing
April 19, Results due (Next Thursday)
April 24 or 26, Results and system rankings
May 8 Group reports and discussion

3
All Minitrec Runs
4
All Groups Best Runs
5
All Groups Best Runs RRL
6
Results Data

trec_eval runs for each submitted file have been
put into a new directory called RESULTS in your
group directories
The trec_eval parameters used for these runs are
-o for the .res files and -o q for the
.resq files. The .dat files contain the
recall level and precision values used for the
preceding plots
The qrels for the Mini-TREC queries are available
now in the /projects/i240 directory as
MINI_TREC_QRELS

7
Mini-TREC Reports

In-Class Presentations May 8th
Written report due May 8th (Last day of Class)
4-5 pages
Content
System description
What approach/modifications were taken?
results of official submissions (see RESULTS)
results of post-runs new runs with results
using MINI_TREC_QRELS and trec_eval

8
Term Paper

Should be about 8-15 pages on
some area of IR research (or practice) that you
are interested in and want to study further
Experimental tests of systems or IR algorithms
Build an IR system, test it, and describe the
system and its performance
Due May 8th (Last day of class)

9
Today

Review
Web Crawling and Search Issues
Web Search Engines and Algorithms
Web Search Processing
Parallel Architectures (Inktomi - Brewer)
Cheshire III Design

Credit for some of the slides in this lecture
goes to Marti Hearst and Eric Brewer
10
Web Crawlers

How do the web search engines get all of the
items they index?
More precisely
Put a set of known sites on a queue
Repeat the following until the queue is empty
Take the first page off of the queue
If this page has not yet been processed
Record the information found on this page
Positions of words, links going out, etc
Add each link on the current page to the queue
Record that this page has been processed
In what order should the links be followed?

11
Page Visit Order

Animated examples of breadth-first vs depth-first
search on trees
http//www.rci.rutgers.edu/cfs/472_html/AI_SEARCH
/ExhaustiveSearch.html

12
Sites Are Complex Graphs, Not Just Trees
13
Web Crawling Issues

Keep out signs
A file called robots.txt tells the crawler which
directories are off limits
Freshness
Figure out which pages change often
Recrawl these often
Duplicates, virtual hosts, etc
Convert page contents with a hash function
Compare new pages to the hash table
Lots of problems
Server unavailable
Incorrect html
Missing links
Infinite loops
Web crawling is difficult to do robustly!

14
Search Engines

Crawling
Indexing
Querying

15
Web Search Engine Layers
From description of the FAST search engine, by
Knut Risvikhttp//www.infonortics.com/searchengin
es/sh00/risvik_files/frame.htm
16
Standard Web Search Engine Architecture
Check for duplicates, store the documents
DocIds
crawl the web
user query
create an inverted index
Inverted index
Search engine servers
Show results To user
17
More detailed architecture,from Brin Page
98.Only covers the preprocessing in detail, not
the query serving.
18
Indexes for Web Search Engines

Inverted indexes are still used, even though the
web is so huge
Most current web search systems partition the
indexes across different machines
Each machine handles different parts of the data
(Google uses thousands of PC-class processors and
keeps most things in main memory)
Other systems duplicate the data across many
machines
Queries are distributed among the machines
Most do a combination of these

19
Search Engine Querying
In this example, the data for the pages is
partitioned across machines. Additionally, each
partition is allocated multiple machines to
handle the queries. Each row can handle 120
queries per second Each column can handle 7M
pages To handle more queries, add another row.
From description of the FAST search engine, by
Knut Risvikhttp//www.infonortics.com/searchengin
es/sh00/risvik_files/frame.htm
20
Querying Cascading Allocation of CPUs

A variation on this that produces a cost-savings
Put high-quality/common pages on many machines
Put lower quality/less common pages on fewer
machines
Query goes to high quality machines first
If no hits found there, go to other machines

21
Google

Google maintains (probably) the worlds largest
Linux cluster (over 15,000 servers)
These are partitioned between index servers and
page servers
Index servers resolve the queries (massively
parallel processing)
Page servers deliver the results of the queries
Over 8 Billion web pages are indexed and served
by Google

22
Search Engine Indexes

Starting Points for Users include
Manually compiled lists
Directories
Page popularity
Frequently visited pages (in general)
Frequently visited pages as a result of a query
Link co-citation
Which sites are linked to by other sites?

23
Starting Points What is Really Being Used?

Todays search engines combine these methods in
various ways
Integration of Directories
Today most web search engines integrate
categories into the results listings
Lycos, MSN, Google
Link analysis
Google uses it others are also using it
Words on the links seems to be especially useful
Page popularity
Many use DirectHits popularity rankings

24
Web Page Ranking

Varies by search engine
Pretty messy in many cases
Details usually proprietary and fluctuating
Combining subsets of
Term frequencies
Term proximities
Term position (title, top of page, etc)
Term characteristics (boldface, capitalized, etc)
Link analysis information
Category information
Popularity information

25
Ranking Hearst 96

Proximity search can help get high-precision
results if gt1 term
Combine Boolean and passage-level proximity
Proves significant improvements when retrieving
top 5, 10, 20, 30 documents
Results reproduced by Mitra et al. 98
Google uses something similar

26
Ranking Link Analysis

Assumptions
If the pages pointing to this page are good, then
this is also a good page
The words on the links pointing to this page are
useful indicators of what this page is about
References Page et al. 98, Kleinberg 98

27
Ranking Link Analysis

Why does this work?
The official Toyota site will be linked to by
lots of other official (or high-quality) sites
The best Toyota fan-club site probably also has
many links pointing to it
Less high-quality sites do not have as many
high-quality sites linking to them

28
Ranking PageRank

Google uses the PageRank
We assume page A has pages T1...Tn which point to
it (i.e., are citations). The parameter d is a
damping factor which can be set between 0 and 1.
d is usually set to 0.85. C(A) is defined as the
number of links going out of page A. The PageRank
of a page A is given as follows
PR(A) (1-d) d (PR(T1)/C(T1) ...
PR(Tn)/C(Tn))
Note that the PageRanks form a probability
distribution over web pages, so the sum of all
web pages' PageRanks will be one

29
PageRank
Note these are not real PageRanks, since they
include values gt 1
T3 Pr1
X2
X1
T1 Pr.725
T4 Pr1
A Pr4.2544375
T2 Pr1
T5 Pr1
T8 Pr2.46625
T7 Pr1
T6 Pr1
30
PageRank

Similar to calculations used in scientific
citation analysis (e.g., Garfield et al.) and
social network analysis (e.g., Waserman et al.)
Similar to other work on ranking (e.g., the hubs
and authorities of Kleinberg et al.)
How is Amazon similar to Google in terms of the
basic insights and techniques of PageRank?
How could PageRank be applied to other problems
and domains?

31
Today

Review
Web Crawling and Search Issues
Web Search Engines and Algorithms
Web Search Processing
Parallel Architectures (Inktomi Eric Brewer)
Cheshire III Design

Credit for some of the slides in this lecture
goes to Marti Hearst and Eric Brewer
32
(No Transcript)
33
(No Transcript)
34
(No Transcript)
35
(No Transcript)
36
(No Transcript)
37
(No Transcript)
38
(No Transcript)
39
(No Transcript)
40
(No Transcript)
41
(No Transcript)
42
(No Transcript)
43
(No Transcript)
44
(No Transcript)
45
(No Transcript)
46
(No Transcript)
47
(No Transcript)
48
(No Transcript)
49
(No Transcript)
50
(No Transcript)
51
(No Transcript)
52
Grid-based Search and Data Mining Using Cheshire3
Presented by Ray R. Larson University of
California, Berkeley School of Information

In collaboration with
Robert Sanderson
University of Liverpool
Department of Computer Science

53
Overview

The Grid, Text Mining and Digital Libraries
Grid Architecture
Grid IR Issues
Cheshire3 Bringing Search to Grid-Based Digital
Libraries
Overview
Grid Experiments
Cheshire3 Architecture
Distributed Workflows

54
Grid Architecture -- (Dr. Eric Yen, Academia
Sinica, Taiwan.)
..
High energy physics
Chemical Engineering
Climate
Astrophysics
Cosmology
Combustion
Applications Application Toolkits Grid Service
s Grid Fabric
..
Remote Computing
Remote Visualization
Collaboratories
Remote sensors
Data Grid
Portals
Grid middleware
Protocols, authentication, policy,
instrumentation, Resource management, discovery,
events, etc.
Storage, networks, computers, display devices,
etc. and their associated local services
55
Grid Architecture (ECAI/AS Grid Digital Library
Workshop)
Digital Libraries
High energy physics
Humanities computing
Bio-Medical
Chemical Engineering
Astrophysics
Climate
Cosmology
Combustion

Applications Application Toolkits Grid Service
s Grid Fabric

Text Mining
Remote Computing
Remote Visualization
Metadata management
Search Retrieval
Collaboratories
Remote sensors
Data Grid
Portals
Grid middleware
Protocols, authentication, policy,
instrumentation, Resource management, discovery,
events, etc.
Storage, networks, computers, display devices,
etc. and their associated local services
56
Grid-Based Digital Libraries

Large-scale distributed storage requirements and
technologies
Organizing distributed digital collections
Shared Metadata standards and requirements
Managing distributed digital collections
Security and access control
Collection Replication and backup
Distributed Information Retrieval issues and
algorithms

57
Grid IR Issues

Want to preserve the same retrieval performance
(precision/recall) while hopefully increasing
efficiency (I.e. speed)
Very large-scale distribution of resources is a
challenge for sub-second retrieval
Different from most other typical Grid processes,
IR is potentially less computing intensive and
more data intensive
In many ways Grid IR replicates the process (and
problems) of metasearch or distributed search

58
Introduction

Cheshire History
Developed at UC Berkeley originally
Solution for library data (C1), then SGML (C2),
then XML
Monolithic applications for indexing and
retrieval server in C TCL scripting
Cheshire3
Developed at Liverpool, plus Berkeley
XML, Unicode, Grid scalable Standards based
Object Oriented Framework
Easy to develop and extend in Python

59
Introduction

Today
Version 0.9.4
Mostly stable, but needs thorough QA and docs
Grid, NLP and Classification algorithms
integrated
Near Future
June Version 1.0
Further DM/TM integration, docs, unit tests,
stability
December Version 1.1
Grid out-of-the-box, configuration GUI

60
Context

Environmental Requirements
Very Large scale information systems
Terabyte scale (Data Grid)
Computationally expensive processes (Comp. Grid)
Digital Preservation
Analysis of data, not just retrieval (Data/Text
Mining)
Ease of Extensibility, Customizability (Python)
Open Source
Integrate not Re-implement
"Web 2.0" interactivity and dynamic interfaces

61
Context
62
Cheshire3 Object Model
Protocol Handler
Record
63
Object Configuration

One XML 'record' per non-data object
Very simple base schema, with extensions as
needed
Identifiers for objects unique within a
context(e.g., unique at individual database
level, but not necessarily between all databases)
Allows workflows to reference by identifier but
act appropriately within different contexts.
Allows multiple administrators to define objects
without reference to each other

64
Grid

Focus on ingest, not discovery (yet)
Instantiate architecture on every node
Assign one node as master, rest as slaves. Master
then divides the processing as appropriate.
Calls between slaves possible
Calls as small, simple as possible (objectIdenti
fier, functionName, arguments)
Typically('workflow-id', 'process',
'document-id')

65
Grid Architecture
Master Task
(workflow, process, document)
(workflow, process, document)
fetch document
fetch document
Data Grid
document
document
Slave Task 1
Slave Task N
extracted data
extracted data
GPFS Temporary Storage
66
Grid Architecture - Phase 2
Master Task
(index, load)
(index, load)
store index
store index
Data Grid
Slave Task 1
Slave Task N
fetch extracted data
fetch extracted data
GPFS Temporary Storage
67
Workflow Objects

Written as XML within the configuration record.
Rewrites and compiles to Python code on object
instantiation
Current instructions
object
assign
fork
for-each
break/continue
try/except/raise
return
log ( send text to default logger object)
Yes, no if!

68
Workflow example
ltsubConfig idbuildSingleWorkflowgt ltobjectTypegt
workflow.SimpleWorkflowlt/objectTypegt ltworkflowgt
ltobject typeworkflow refPreParserWorkflow/gt
lttrygt ltobject typeparser
refNsSaxParser/gt lt/trygt ltexceptgt
ltloggtUnparsable Recordlt/loggt ltraise/gt
lt/exceptgt ltobject typerecordStore
functioncreate_record/gt ltobject
typedatabase functionadd_record/gt ltobject
typedatabase functionindex_record/gt
ltloggtLoaded Record input.idlt/loggt lt/workflowgt
lt/subConfiggt
69
Text Mining

Integration of Natural Language Processing tools
Including
Part of Speech taggers (noun, verb,
adjective,...)
Phrase Extraction
Deep Parsing (subject, verb, object,
preposition,...)
Linguistic Stemming (is/be fairy/fairy vs is/is
fairy/fairi)
Planned Information Extraction tools

70
Data Mining

Integration of toolkits difficult unless they
support sparse vectors as input - text is high
dimensional, but has lots of zeroes
Focus on automatic classification for predefined
categories rather than clustering
Algorithms integrated/implemented
Perceptron, Neural Network (pure python)
Naïve Bayes (pure python)
SVM (libsvm integrated with python wrapper)
Classification Association Rule Mining (Java)

71
Data Mining

Modelled as multi-stage PreParser object
(training phase, prediction phase)
Plus need for AccumulatingDocumentFactory to
merge document vectors together into single
output for training some algorithms (e.g., SVM)
Prediction phase attaches metadata (predicted
class) to document object, which can be stored in
DocumentStore
Document vectors generated per index per
document, so integrated NLP document
normalization for free

72
Data Mining Text Mining

Testing integrated environment with 500,000
medline abstracts, using various NLP tools,
classification algorithms, and evaluation
strategies.
Computational grid for distributing expensive NLP
analysis
Results show better accuracy with fewer
attributes

73
Applications (1)

Automated Collection Strength Analysis
Primary aim Test if data mining techniques
could be used to develop a coverage map of items
available in the London libraries.
The strengths within the library collections were
automatically determined through enrichment and
analysis of bibliographic level metadata records.
This involved very large scale processing of
records to
Deduplicate millions of records
Enrich deduplicated records against database of
45 million
Automatically reclassify enriched records using
machine learning processes (Naïve Bayes)

74
Applications (1)

Data mining enhances collection mapping
strategies by making a larger proportion of the
data usable, by discovering hidden relationships
between textual subjects and hierarchically based
classification systems.
The graph shows the comparison of numbers of
books classified in the domain of Psychology
originally and after enhancement using data
mining

75
Applications (2)

Assessing the Grade Level of NSDL Education
Material
The National Science Digital Library has
assembled a collection of URLs that point to
educational material for scientific disciplines
for all grade levels. These are harvested into
the SRB data grid.
Working with SDSC we assessed the grade-level
relevance by examining the vocabulary used in the
material present at each registered URL.
We determined the vocabulary-based grade-level
with the Flesch-Kincaid grade level assessment.
The domain of each website was then determined
using data mining techniques (TF-IDF derived fast
domain classifier).
This processing was done on the Teragrid cluster
at SDSC.

76
Cheshire3 Grid Tests

Running on an 30 processor cluster in Liverpool
using PVM (parallel virtual machine)
Using 16 processors with one master and 22
slave processes we were able to parse and index
MARC data at about 13000 records per second
On a similar setup 610 Mb of TEI data can be
parsed and indexed in seconds

77
SRB and SDSC Experiments

We are working with SDSC to include SRB support
We are planning to continue working with SDSC and
to run further evaluations using the TeraGrid
server(s) through a small grant for 30000 CPU
hours
SDSC's TeraGrid cluster currently consists of
256 IBM cluster nodes, each with dual 1.5 GHz
Intel Itanium 2 processors, for a peak
performance of 3.1 teraflops. The nodes are
equipped with four gigabytes (GBs) of physical
memory per node. The cluster is running SuSE
Linux and is using Myricom's Myrinet cluster
interconnect network.
Planned large-scale test collections include
NSDL, the NARA repository, CiteSeer and the
million books collections of the Internet
Archive

78
Conclusions

Scalable Grid-Based digital library services can
be created and provide support for very large
collections with improved efficiency
The Cheshire3 IR and DL architecture can provide
Grid (or single processor) services for
next-generation DLs
Available as open source via
http//cheshire3.sourceforge.net or
http//www.cheshire3.org/

Write a Comment

User Comments (0)

About PowerShow.com

Prof. Ray Larson PowerPoint PPT Presentation