Prof. Ray Larson - PowerPoint PPT Presentation

About This Presentation
Title:

Prof. Ray Larson

Description:

Lecture 23: Web Searching Principles of Information Retrieval Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am ... – PowerPoint PPT presentation

Number of Views:153
Avg rating:3.0/5.0
Slides: 79
Provided by: ValuedGa251
Category:

less

Transcript and Presenter's Notes

Title: Prof. Ray Larson


1
Lecture 23 Web Searching
Principles of Information Retrieval
  • Prof. Ray Larson
  • University of California, Berkeley
  • School of Information
  • Tuesday and Thursday 1030 am - 1200 pm
  • Spring 2007
  • http//courses.ischool.berkeley.edu/i240/s07

2
Mini-TREC
  • Proposed Schedule
  • February 15 Database and previous Queries
  • February 27 report on system acquisition and
    setup
  • March 8, New Queries for testing
  • April 19, Results due (Next Thursday)
  • April 24 or 26, Results and system rankings
  • May 8 Group reports and discussion

3
All Minitrec Runs
4
All Groups Best Runs
5
All Groups Best Runs RRL
6
Results Data
  • trec_eval runs for each submitted file have been
    put into a new directory called RESULTS in your
    group directories
  • The trec_eval parameters used for these runs are
    -o for the .res files and -o q for the
    .resq files. The .dat files contain the
    recall level and precision values used for the
    preceding plots
  • The qrels for the Mini-TREC queries are available
    now in the /projects/i240 directory as
    MINI_TREC_QRELS

7
Mini-TREC Reports
  • In-Class Presentations May 8th
  • Written report due May 8th (Last day of Class)
    4-5 pages
  • Content
  • System description
  • What approach/modifications were taken?
  • results of official submissions (see RESULTS)
  • results of post-runs new runs with results
    using MINI_TREC_QRELS and trec_eval

8
Term Paper
  • Should be about 8-15 pages on
  • some area of IR research (or practice) that you
    are interested in and want to study further
  • Experimental tests of systems or IR algorithms
  • Build an IR system, test it, and describe the
    system and its performance
  • Due May 8th (Last day of class)

9
Today
  • Review
  • Web Crawling and Search Issues
  • Web Search Engines and Algorithms
  • Web Search Processing
  • Parallel Architectures (Inktomi - Brewer)
  • Cheshire III Design

Credit for some of the slides in this lecture
goes to Marti Hearst and Eric Brewer
10
Web Crawlers
  • How do the web search engines get all of the
    items they index?
  • More precisely
  • Put a set of known sites on a queue
  • Repeat the following until the queue is empty
  • Take the first page off of the queue
  • If this page has not yet been processed
  • Record the information found on this page
  • Positions of words, links going out, etc
  • Add each link on the current page to the queue
  • Record that this page has been processed
  • In what order should the links be followed?

11
Page Visit Order
  • Animated examples of breadth-first vs depth-first
    search on trees
  • http//www.rci.rutgers.edu/cfs/472_html/AI_SEARCH
    /ExhaustiveSearch.html

12
Sites Are Complex Graphs, Not Just Trees
13
Web Crawling Issues
  • Keep out signs
  • A file called robots.txt tells the crawler which
    directories are off limits
  • Freshness
  • Figure out which pages change often
  • Recrawl these often
  • Duplicates, virtual hosts, etc
  • Convert page contents with a hash function
  • Compare new pages to the hash table
  • Lots of problems
  • Server unavailable
  • Incorrect html
  • Missing links
  • Infinite loops
  • Web crawling is difficult to do robustly!

14
Search Engines
  • Crawling
  • Indexing
  • Querying

15
Web Search Engine Layers
From description of the FAST search engine, by
Knut Risvikhttp//www.infonortics.com/searchengin
es/sh00/risvik_files/frame.htm
16
Standard Web Search Engine Architecture
Check for duplicates, store the documents
DocIds
crawl the web
user query
create an inverted index
Inverted index
Search engine servers
Show results To user
17
More detailed architecture,from Brin Page
98.Only covers the preprocessing in detail, not
the query serving.
18
Indexes for Web Search Engines
  • Inverted indexes are still used, even though the
    web is so huge
  • Most current web search systems partition the
    indexes across different machines
  • Each machine handles different parts of the data
    (Google uses thousands of PC-class processors and
    keeps most things in main memory)
  • Other systems duplicate the data across many
    machines
  • Queries are distributed among the machines
  • Most do a combination of these

19
Search Engine Querying
In this example, the data for the pages is
partitioned across machines. Additionally, each
partition is allocated multiple machines to
handle the queries. Each row can handle 120
queries per second Each column can handle 7M
pages To handle more queries, add another row.
From description of the FAST search engine, by
Knut Risvikhttp//www.infonortics.com/searchengin
es/sh00/risvik_files/frame.htm
20
Querying Cascading Allocation of CPUs
  • A variation on this that produces a cost-savings
  • Put high-quality/common pages on many machines
  • Put lower quality/less common pages on fewer
    machines
  • Query goes to high quality machines first
  • If no hits found there, go to other machines

21
Google
  • Google maintains (probably) the worlds largest
    Linux cluster (over 15,000 servers)
  • These are partitioned between index servers and
    page servers
  • Index servers resolve the queries (massively
    parallel processing)
  • Page servers deliver the results of the queries
  • Over 8 Billion web pages are indexed and served
    by Google

22
Search Engine Indexes
  • Starting Points for Users include
  • Manually compiled lists
  • Directories
  • Page popularity
  • Frequently visited pages (in general)
  • Frequently visited pages as a result of a query
  • Link co-citation
  • Which sites are linked to by other sites?

23
Starting Points What is Really Being Used?
  • Todays search engines combine these methods in
    various ways
  • Integration of Directories
  • Today most web search engines integrate
    categories into the results listings
  • Lycos, MSN, Google
  • Link analysis
  • Google uses it others are also using it
  • Words on the links seems to be especially useful
  • Page popularity
  • Many use DirectHits popularity rankings

24
Web Page Ranking
  • Varies by search engine
  • Pretty messy in many cases
  • Details usually proprietary and fluctuating
  • Combining subsets of
  • Term frequencies
  • Term proximities
  • Term position (title, top of page, etc)
  • Term characteristics (boldface, capitalized, etc)
  • Link analysis information
  • Category information
  • Popularity information

25
Ranking Hearst 96
  • Proximity search can help get high-precision
    results if gt1 term
  • Combine Boolean and passage-level proximity
  • Proves significant improvements when retrieving
    top 5, 10, 20, 30 documents
  • Results reproduced by Mitra et al. 98
  • Google uses something similar

26
Ranking Link Analysis
  • Assumptions
  • If the pages pointing to this page are good, then
    this is also a good page
  • The words on the links pointing to this page are
    useful indicators of what this page is about
  • References Page et al. 98, Kleinberg 98

27
Ranking Link Analysis
  • Why does this work?
  • The official Toyota site will be linked to by
    lots of other official (or high-quality) sites
  • The best Toyota fan-club site probably also has
    many links pointing to it
  • Less high-quality sites do not have as many
    high-quality sites linking to them

28
Ranking PageRank
  • Google uses the PageRank
  • We assume page A has pages T1...Tn which point to
    it (i.e., are citations). The parameter d is a
    damping factor which can be set between 0 and 1.
    d is usually set to 0.85. C(A) is defined as the
    number of links going out of page A. The PageRank
    of a page A is given as follows
  • PR(A) (1-d) d (PR(T1)/C(T1) ...
    PR(Tn)/C(Tn))
  • Note that the PageRanks form a probability
    distribution over web pages, so the sum of all
    web pages' PageRanks will be one

29
PageRank
Note these are not real PageRanks, since they
include values gt 1
T3 Pr1
X2
X1
T1 Pr.725
T4 Pr1
A Pr4.2544375
T2 Pr1
T5 Pr1
T8 Pr2.46625
T7 Pr1
T6 Pr1
30
PageRank
  • Similar to calculations used in scientific
    citation analysis (e.g., Garfield et al.) and
    social network analysis (e.g., Waserman et al.)
  • Similar to other work on ranking (e.g., the hubs
    and authorities of Kleinberg et al.)
  • How is Amazon similar to Google in terms of the
    basic insights and techniques of PageRank?
  • How could PageRank be applied to other problems
    and domains?

31
Today
  • Review
  • Web Crawling and Search Issues
  • Web Search Engines and Algorithms
  • Web Search Processing
  • Parallel Architectures (Inktomi Eric Brewer)
  • Cheshire III Design

Credit for some of the slides in this lecture
goes to Marti Hearst and Eric Brewer
32
(No Transcript)
33
(No Transcript)
34
(No Transcript)
35
(No Transcript)
36
(No Transcript)
37
(No Transcript)
38
(No Transcript)
39
(No Transcript)
40
(No Transcript)
41
(No Transcript)
42
(No Transcript)
43
(No Transcript)
44
(No Transcript)
45
(No Transcript)
46
(No Transcript)
47
(No Transcript)
48
(No Transcript)
49
(No Transcript)
50
(No Transcript)
51
(No Transcript)
52
Grid-based Search and Data Mining Using Cheshire3
Presented by Ray R. Larson University of
California, Berkeley School of Information
  • In collaboration with
  • Robert Sanderson
  • University of Liverpool
  • Department of Computer Science

53
Overview
  • The Grid, Text Mining and Digital Libraries
  • Grid Architecture
  • Grid IR Issues
  • Cheshire3 Bringing Search to Grid-Based Digital
    Libraries
  • Overview
  • Grid Experiments
  • Cheshire3 Architecture
  • Distributed Workflows

54
Grid Architecture -- (Dr. Eric Yen, Academia
Sinica, Taiwan.)
..
High energy physics
Chemical Engineering
Climate
Astrophysics
Cosmology
Combustion
Applications Application Toolkits Grid Service
s Grid Fabric
..
Remote Computing
Remote Visualization
Collaboratories
Remote sensors
Data Grid
Portals
Grid middleware
Protocols, authentication, policy,
instrumentation, Resource management, discovery,
events, etc.
Storage, networks, computers, display devices,
etc. and their associated local services
55
Grid Architecture (ECAI/AS Grid Digital Library
Workshop)
Digital Libraries
High energy physics
Humanities computing
Bio-Medical
Chemical Engineering
Astrophysics
Climate
Cosmology
Combustion

Applications Application Toolkits Grid Service
s Grid Fabric

Text Mining
Remote Computing
Remote Visualization
Metadata management
Search Retrieval
Collaboratories
Remote sensors
Data Grid
Portals
Grid middleware
Protocols, authentication, policy,
instrumentation, Resource management, discovery,
events, etc.
Storage, networks, computers, display devices,
etc. and their associated local services
56
Grid-Based Digital Libraries
  • Large-scale distributed storage requirements and
    technologies
  • Organizing distributed digital collections
  • Shared Metadata standards and requirements
  • Managing distributed digital collections
  • Security and access control
  • Collection Replication and backup
  • Distributed Information Retrieval issues and
    algorithms

57
Grid IR Issues
  • Want to preserve the same retrieval performance
    (precision/recall) while hopefully increasing
    efficiency (I.e. speed)
  • Very large-scale distribution of resources is a
    challenge for sub-second retrieval
  • Different from most other typical Grid processes,
    IR is potentially less computing intensive and
    more data intensive
  • In many ways Grid IR replicates the process (and
    problems) of metasearch or distributed search

58
Introduction
  • Cheshire History
  • Developed at UC Berkeley originally
  • Solution for library data (C1), then SGML (C2),
    then XML
  • Monolithic applications for indexing and
    retrieval server in C TCL scripting
  • Cheshire3
  • Developed at Liverpool, plus Berkeley
  • XML, Unicode, Grid scalable Standards based
  • Object Oriented Framework
  • Easy to develop and extend in Python

59
Introduction
  • Today
  • Version 0.9.4
  • Mostly stable, but needs thorough QA and docs
  • Grid, NLP and Classification algorithms
    integrated
  • Near Future
  • June Version 1.0
  • Further DM/TM integration, docs, unit tests,
    stability
  • December Version 1.1
  • Grid out-of-the-box, configuration GUI

60
Context
  • Environmental Requirements
  • Very Large scale information systems
  • Terabyte scale (Data Grid)
  • Computationally expensive processes (Comp. Grid)
  • Digital Preservation
  • Analysis of data, not just retrieval (Data/Text
    Mining)
  • Ease of Extensibility, Customizability (Python)
  • Open Source
  • Integrate not Re-implement
  • "Web 2.0" interactivity and dynamic interfaces

61
Context
62
Cheshire3 Object Model
Protocol Handler
Record
63
Object Configuration
  • One XML 'record' per non-data object
  • Very simple base schema, with extensions as
    needed
  • Identifiers for objects unique within a
    context(e.g., unique at individual database
    level, but not necessarily between all databases)
  • Allows workflows to reference by identifier but
    act appropriately within different contexts.
  • Allows multiple administrators to define objects
    without reference to each other

64
Grid
  • Focus on ingest, not discovery (yet)
  • Instantiate architecture on every node
  • Assign one node as master, rest as slaves. Master
    then divides the processing as appropriate.
  • Calls between slaves possible
  • Calls as small, simple as possible (objectIdenti
    fier, functionName, arguments)
  • Typically('workflow-id', 'process',
    'document-id')

65
Grid Architecture
Master Task
(workflow, process, document)
(workflow, process, document)
fetch document
fetch document
Data Grid
document
document
Slave Task 1
Slave Task N
extracted data
extracted data
GPFS Temporary Storage
66
Grid Architecture - Phase 2
Master Task
(index, load)
(index, load)
store index
store index
Data Grid
Slave Task 1
Slave Task N
fetch extracted data
fetch extracted data
GPFS Temporary Storage
67
Workflow Objects
  • Written as XML within the configuration record.
  • Rewrites and compiles to Python code on object
    instantiation
  • Current instructions
  • object
  • assign
  • fork
  • for-each
  • break/continue
  • try/except/raise
  • return
  • log ( send text to default logger object)
  • Yes, no if!

68
Workflow example
ltsubConfig idbuildSingleWorkflowgt ltobjectTypegt
workflow.SimpleWorkflowlt/objectTypegt ltworkflowgt
ltobject typeworkflow refPreParserWorkflow/gt
lttrygt ltobject typeparser
refNsSaxParser/gt lt/trygt ltexceptgt
ltloggtUnparsable Recordlt/loggt ltraise/gt
lt/exceptgt ltobject typerecordStore
functioncreate_record/gt ltobject
typedatabase functionadd_record/gt ltobject
typedatabase functionindex_record/gt
ltloggtLoaded Record input.idlt/loggt lt/workflowgt
lt/subConfiggt
69
Text Mining
  • Integration of Natural Language Processing tools
  • Including
  • Part of Speech taggers (noun, verb,
    adjective,...)
  • Phrase Extraction
  • Deep Parsing (subject, verb, object,
    preposition,...)
  • Linguistic Stemming (is/be fairy/fairy vs is/is
    fairy/fairi)
  • Planned Information Extraction tools

70
Data Mining
  • Integration of toolkits difficult unless they
    support sparse vectors as input - text is high
    dimensional, but has lots of zeroes
  • Focus on automatic classification for predefined
    categories rather than clustering
  • Algorithms integrated/implemented
  • Perceptron, Neural Network (pure python)
  • Naïve Bayes (pure python)
  • SVM (libsvm integrated with python wrapper)
  • Classification Association Rule Mining (Java)

71
Data Mining
  • Modelled as multi-stage PreParser object
    (training phase, prediction phase)
  • Plus need for AccumulatingDocumentFactory to
    merge document vectors together into single
    output for training some algorithms (e.g., SVM)
  • Prediction phase attaches metadata (predicted
    class) to document object, which can be stored in
    DocumentStore
  • Document vectors generated per index per
    document, so integrated NLP document
    normalization for free

72
Data Mining Text Mining
  • Testing integrated environment with 500,000
    medline abstracts, using various NLP tools,
    classification algorithms, and evaluation
    strategies.
  • Computational grid for distributing expensive NLP
    analysis
  • Results show better accuracy with fewer
    attributes

73
Applications (1)
  • Automated Collection Strength Analysis
  • Primary aim Test if data mining techniques
    could be used to develop a coverage map of items
    available in the London libraries.
  • The strengths within the library collections were
    automatically determined through enrichment and
    analysis of bibliographic level metadata records.
  • This involved very large scale processing of
    records to
  • Deduplicate millions of records
  • Enrich deduplicated records against database of
    45 million
  • Automatically reclassify enriched records using
    machine learning processes (Naïve Bayes)

74
Applications (1)
  • Data mining enhances collection mapping
    strategies by making a larger proportion of the
    data usable, by discovering hidden relationships
    between textual subjects and hierarchically based
    classification systems.
  • The graph shows the comparison of numbers of
    books classified in the domain of Psychology
    originally and after enhancement using data
    mining

75
Applications (2)
  • Assessing the Grade Level of NSDL Education
    Material
  • The National Science Digital Library has
    assembled a collection of URLs that point to
    educational material for scientific disciplines
    for all grade levels. These are harvested into
    the SRB data grid.
  • Working with SDSC we assessed the grade-level
    relevance by examining the vocabulary used in the
    material present at each registered URL.
  • We determined the vocabulary-based grade-level
    with the Flesch-Kincaid grade level assessment.
    The domain of each website was then determined
    using data mining techniques (TF-IDF derived fast
    domain classifier).
  • This processing was done on the Teragrid cluster
    at SDSC.

76
Cheshire3 Grid Tests
  • Running on an 30 processor cluster in Liverpool
    using PVM (parallel virtual machine)
  • Using 16 processors with one master and 22
    slave processes we were able to parse and index
    MARC data at about 13000 records per second
  • On a similar setup 610 Mb of TEI data can be
    parsed and indexed in seconds

77
SRB and SDSC Experiments
  • We are working with SDSC to include SRB support
  • We are planning to continue working with SDSC and
    to run further evaluations using the TeraGrid
    server(s) through a small grant for 30000 CPU
    hours
  • SDSC's TeraGrid cluster currently consists of
    256 IBM cluster nodes, each with dual 1.5 GHz
    Intel Itanium 2 processors, for a peak
    performance of 3.1 teraflops. The nodes are
    equipped with four gigabytes (GBs) of physical
    memory per node. The cluster is running SuSE
    Linux and is using Myricom's Myrinet cluster
    interconnect network.
  • Planned large-scale test collections include
    NSDL, the NARA repository, CiteSeer and the
    million books collections of the Internet
    Archive

78
Conclusions
  • Scalable Grid-Based digital library services can
    be created and provide support for very large
    collections with improved efficiency
  • The Cheshire3 IR and DL architecture can provide
    Grid (or single processor) services for
    next-generation DLs
  • Available as open source via
  • http//cheshire3.sourceforge.net or
  • http//www.cheshire3.org/
Write a Comment
User Comments (0)
About PowerShow.com