Performance Limitations of the Core Java Libraries - PowerPoint PPT Presentation

1 / 8
About This Presentation
Title:

Performance Limitations of the Core Java Libraries

Description:

The Mercator Web Crawler. Mercator is designed to be extensible and scalable ... Java is well-suited to implement a web crawler ... – PowerPoint PPT presentation

Number of Views:86
Avg rating:3.0/5.0
Slides: 9
Provided by: ppar8
Category:

less

Transcript and Presenter's Notes

Title: Performance Limitations of the Core Java Libraries


1
Performance Limitations of the Core Java Libraries
  • Technical Paper by Allan Heydon
  • Marc Najork
  • Compaq Computer Corporation
  • Presented by Poorvi Parikh

2
Introduction
  • The paper talks about the experiences using Java
    to build a web crawler.A web crawler has
    significant I/O requirements, must perform well
    and need to be fault tolerant
  • However in places the Java Core Libraries trades
    off performance for ease of use
  • The focus of this paper is on the performance
    problems while building the web crawler and the
    work around for them

3
The Mercator Web Crawler
  • Mercator is designed to be extensible and
    scalable
  • It is highly multi-threaded, typically using 100
    threads to fetch and process web pages in
    parallel (Java supports this)
  • For portability between Wintel PCs and Alpha
    Unix workstations, Mercator is written in 100
    Java

4
Performance Issue 1
  • PROBLEM
  • There is a need to concatenate strings,
    convert numbers and IP addresses into
    strings.StringBuffers are used for this. They
    introduce unnecessary lock acquisitions.
  • SOLUTION
  • Implementation of Formatter class. It combines
    StringBuffer functionality with formatting
    facilities. This significantly reduces Mercators
    lock acquisition rate.

5
Performance Issue 2
  • PROBLEM
  • PrintStream class is the standard way to write
    a string to an underlying stream.It however
    performs a locking of its own.Printing a string
    causes at least 3 lock acquisitions.
  • SOLUTION
  • BufferedDataOutputStream was written.It is
    completely unsynchronized, placing the
    synchronization burden on the client. This
    reduces the lock acquisition to 1 for a writing a
    line to a download log.

6
Performance Issue 3
  • PROBLEM
  • Host name resolution is a severe bottle-neck
    in Mercator. The implementation of
    InetAddress.getByName causes this.It caches
    previous resolutions results.This cache is
    protected by a single lock.
  • SOLUTION
  • A DNS resolver is used to perform host name
    resolutions in parallel. DatagramSockets is used
    to issue DNS requests directly to a local name
    server.

7
Performance Issue 4
  • PROBLEM
  • A Socket object which gets associated with a
    network connection, cannot be re-opened on a
    different connection. This causes wasteful
    connections in a web crawler.
  • SOLUTION
  • Use of a alternative version of
    BufferedInputStream. This promotes object reuse,
    avoiding wasteful new allocations.

8
Conclusion
  • Java is well-suited to implement a web crawler
  • Java 1.1 Core Libraries were designed for ease of
    use and not for speed
  • A clean solution to the problem is fixing the
    problems in the core libraries
  • All the solutions proposed here are backward
    compatible with the existing Java 1.1 libraries
Write a Comment
User Comments (0)
About PowerShow.com