HTTP for DB Dummies - PowerPoint PPT Presentation

About This Presentation
Title:

HTTP for DB Dummies

Description:

User-Agent: Mozilla/4.07 [en] (X11; I; Linux 2.0.36 i686) identifies client software ... non-local access pattern (trans-atlantic access) ... – PowerPoint PPT presentation

Number of Views:111
Avg rating:3.0/5.0
Slides: 44
Provided by: grib7
Learn more at: https://dsf.berkeley.edu
Category:

less

Transcript and Presenter's Notes

Title: HTTP for DB Dummies


1
HTTP for DB Dummies
  • Steve Gribble
  • gribble_at_cs.berkeley.edu

2
The Web
  • HTTP 1.0 model (slowly fading out, replaced by
    HTTP 1.1)

Client
Server
TCP
cache
3
The Web
Client
Server
cache
4
Basics of HTTP
5
Structure of a Request
ltMETHODgt ltURLgt ltHTTPVERSIONgt\r\n ltHEADERNAMEgt
ltHEADERVALgt\r\n ltHEADERNAMEgt ltHEADERVALgt\r\n \r
\n ltDATA, IF POSTgt
GET /test/index.html?foobarbaznamesteve
HTTP/1.0\r\n Connection Keep-Alive\r\n User-Agent
Mozilla/4.07 en (X11 I Linux 2.0.36
i686)\r\n Host ninja.cs.berkeley.edu5556\r\n Acc
ept image/gif, image/x-xbitmap, image/jpeg,
image/pjpeg, image/png, /\r\n Accept-Encoding
gzip\r\n Accept-Language en\r\n Accept-Charset
iso-8859-1,,utf-8\r\n \r\n
6
Structure of a Response
ltHTTPVERSIONgt ltSTATUS CODEgt ltMSGgt\r\n ltHEADERNAMEgt
ltHEADERVALgt\r\n ltHEADERNAMEgt
ltHEADERVALgt\r\n \r\n ltDATA, IF NECESSARYgt
HTTP/1.0 200 OK Server Netscape-Enterprise/2.01 D
ate Thu, 04 Feb 1999 002819 GMT Accept-ranges
bytes Last-modified Wed, 01 Jul 1998 170738
GMT Content-length 1848 Content-type text/html
7
TCP level analysis
HTTP 1.0
FTP ( gt2nd file)
8
Interesting TCP gotchas
  • Mandatory roundtrips
  • TCP three-way handshake
  • get request, data return
  • new connections for each inlined image
    (parallelize)
  • lots of extra syn or syn/ack packets
  • Slow-start penalties
  • can show only affects fast networks, not modems
  • Lots of TCP connections to server
  • spatial/processing overhead in server (TCP stack)
  • many protocol control block (PCB) TIME_WAIT
    entries
  • unfairness because of loss of congestion control
    info

9
Fix?
  • Persistent HTTP
  • in HTTP/1.0, add Connection Keep-Alive\r\n
    header
  • in HTTP/1.1, P-HTTP built in
  • Does it help?
  • mostly for server-side reasons, not network
    efficiency
  • allows pipelining of multiple requests on one
    connection
  • Does it hurt?
  • how does a client know when document is returned?
  • when does the connection get dropped?
  • idle timeouts on server side
  • client drops connections
  • server needs to reclaim resources

10
HTTP/1.0 Client Methods
  • GET
  • fetch and return a document
  • URL can be overloaded to submit form data
  • GET /foo/bar.html?xbarbambaz
  • POST
  • submit a form, and receive response
  • HEAD
  • like GET, but only return HTTP headers and not
    the data itself. Useful for caching
  • PUT, DELETE, LINK, UNLINK
  • not really used - big security issues if not
    careful

11
HTTP/1.0 Status Codes
  • Family of codes, with 5 types
  • 1xx informational
  • 2xx successful, e.g. 200 OK
  • 3xx redirection (gotcha redirection loops?)
  • 301 Moved Permanently
  • 304 Not Modified
  • 4xx Client Error
  • 400 Bad Request
  • 401 Unauthorized
  • 403 Forbidden
  • 404 Not Found
  • 5xx Server Error
  • 501 Not Implemented
  • 503 Service Unavailable

12
HTTP/1.0 Headers (case insensitive?)
  • Allow - returned by server
  • Allow GET, HEAD
  • never used in practice - clients know what they
    can do
  • Authorization - sent by client
  • Authorization ltcredentialsgt
  • Basic Auth is commonly used
  • ltcredentialsgt Base64( usernamepassword )
  • ok if inside an SSL connection (encrypted)
  • Content-Encoding - sent by either
  • Content-Encoding x-gzip
  • selects an encoding for the transport, not the
    content
  • sadly, no common support for encodings (Windows)

13
HTTP/1.0 Headers continued
  • Content-Length - sent by either
  • Content-Length 56
  • how much payload is being sent?
  • necessary for persistent HTTP, or for POSTs
  • Content-Type - sent by server
  • Content-Type text/html
  • what MIME type the payload is
  • nasty one multipart/mixed
  • Date
  • Date Tue, 15 Nov 1994 081231 GMT
  • 3 accepted date formats (RFC 822, RFC 850,
    asctime())

14
HTTP/1.0 headers, continued
  • Expires - sent by server
  • Expires Thu, 01 Dec 1994 160000 GMT
  • primitive caching expiration date
  • cannot force clients to update view, only on
    refresh
  • From - sent by client
  • From gribble_at_cs.berkeley.edu
  • not really used
  • If-Modified-Since - sent by client
  • If-Modified-Since Sat, 29 Oct 1994 194331 GMT
  • server returns data if modified, else 304 Not
    Modified

15
HTTP/1.0 headers, cont
  • Last-Modified - returned by server
  • Last-Modified Sat, 29 Oct 1994 194331 GMT
  • semantically imprecise - file modification?
    Record timestamp? Date in case file dynamically
    generated?
  • used with If-Modified-Since and HEAD method
  • Location - returned by server
  • Location http//www.cs.ubc.ca
  • used in case of 3xx redirections
  • Pragma - sent by client or server
  • Pragma no-cache
  • extensibility mechanism. No-cache is the only
    popularly used pragma, AFAIK

16
HTTP/1.0 headers, cont
  • Referer - sent by client
  • Referer http//www.xxx-smut.com
  • specifies address from which request was
    generated
  • all sorts of privacy issues - must be careful
    with this
  • Server - returned by server
  • Server Netscape-Enterprise/2.01
  • identifies server software. why? (measurement)
  • User-Agent - sent by client
  • User-Agent Mozilla/4.07 en (X11 I Linux
    2.0.36 i686)
  • identifies client software
  • why? Optimize layout, send based on capability of
    client.
  • Hint just pretend to be Netscape. MSIE does..

17
HTTP/1.0 Server headers
  • WWW-Authenticate - sent by server
  • WWW-Authenticate ltchallengegt
  • tells client to resend request with
    Authorization header
  • Incrementally added hacks
  • Accept image/gif, image/jpeg, text/, /
  • Accept-Encoding gzip
  • Accept-Language en
  • Retry-After (date) or (seconds)
  • Set-Cookie Part_Number"Rocket_Launcher_0001"
    Version"1" Path"/acme"
  • Title (title)

18
HTTP/1.1 Additions
  • Lots of problems associated with HTTP/1.0
  • the network problems we talked about before
  • very poor cache consistency models
  • difficulty implementing multi-homed servers
  • want 1 IP address with multiple DNS names - how?
  • hard to precalculate content-lengths
  • connection dropped lost data
  • no chunking
  • HTTP/1.1 is bloated spec to fix these problems
  • introduces many complexities
  • no longer an easy protocol to implement

19
HTTP/1.1 - a Taste of the New
  • Host www.ninja.com
  • clients MUST send this - fixes multi-homed
    problem
  • already in most 1.0 and 1.1 clients
  • Range bytes300-304,601-993
  • useful broken connection recovery (like FTP
    recovery)
  • Age ltseconds, dategt
  • expiration from caches
  • Etag fa898a3e3
  • unique tag to identify document (strong or weak
    forms)
  • Cache-control ltcommandgt
  • marking documents as private (dont keep in
    caches)
  • chunked transfer encoding
  • segmenting of documents - dont have to calculate
    entire document length. Useful for dynamic query
    responses..

20
Architectural Complexities
21
Caches
Client
Server
TCP
Original web
cache
  • Problem no locality
  • non-local access pattern (trans-atlantic access)
  • servers serving the same bytes millions of times
    to localized communities of users

22
Solution Cache Hierarchy
Client
Server
Cache
Cache
cache
Cache
  • NLANR cache hierarchy most widely developed
  • informally uses Squid cache
  • root servers squirt out 30GB per day
  • anybody can join...

23
Gotchas
  • Staleness
  • HTTP/1.1 cache consistency mechanisms mostly
    solve
  • Security
  • what happens if I infiltrate a cache?
  • servers/clients dont even know this is happening
  • e.g. AOL used to have a very stale cache, but
    has since moved to Inktomi
  • Ad clickthrough counts
  • how does Yahoo know how many times you accessed
    their pages, or more importantly, their ads?

24
CGI-BIN gateways
URL
URL
Client
httpd
CGI code
data
data
File System
cache
  • CGI Common Gateway Interface
  • interface that allows independent authors to
    develop code that interacts with web servers
  • dynamic content generation, especially from
    scripts
  • CGI programs execute in separate process,
    typically

25
CGI-BIN to DB gateways
URL
URL
Client
httpd
CGI code
ODBC / JDBC / etc.
data
DB
data
File System
cache
  • JDBC/ODBC gateways
  • single-node DB, often running on remote host
  • long, blocking operations, usually
  • nasty transactional issues - how does client know
    that action succeeded or failed?
  • Datek/ETrade troubles

26
cgi-bin security
  • Lots of gotchas with CGI-BIN programs
  • buffer overflows (maximum length checks?)
  • shell metacharacter expansion
  • what happens if you put
  • cat /etc/passwd
  • in a form field?
  • sending mail, reading files
  • redirection - allows bypassing IP address-based
    security

27
Multiple server support
  • Weve seen how single IP address can server
    multiple web sites with Host HTTP/1.1 field
  • what about having multiple physical hosts serving
    a single web site?
  • useful for scalability reasons

Server
Server
Client
TCP
Server
Server
cache
www.hotbot.com
28
Solutions
  • DNS round-robin
  • assign multiple IP addresses to single domain
    name
  • client selects amongst them in order
  • shortcomings
  • exposes individual nodes to clients
  • cant take into account machine capabilities
    (multiprocessors) and currently experienced load
  • Front-end redirection
  • single front-end node serves HTTP redirect to
    selected backend node
  • introduces extra round-trip, FE is single point
    of failure

29
More solutions
  • IP-level multiplexing through smart router
  • munge IP packets and send them to selected host
  • Cisco, SUN, etc. make hardware to do this
  • Cisco LocalDirector
  • tricky state management issues, failure semantics
  • Smart Clients
  • Netscape Proxy Autoconfig (PAC) mechanism
  • only useful if connecting via proxy
  • Javascript selects from amongst proxies
  • No HTTP protocol support for smart client access
    to web servers

30
The Real Picture of the Web
URL
Redirector
Client
cache / firewall
data
cache




HTTP Server
HTTP Server
HTTP Server
HTTP Server
I
I
I
I
CGI code
DB
www.nytimes.com
31
Web Characteristics
32
UCB HIP trace
  • Web traffic circa 1997 is primarily
  • GIF data
  • 27 of bytes transferred, 51 of files
    transferred
  • average size 4.1 KB
  • JPEG data
  • 31 of bytes transferred, 16 of files
    transferred
  • average size 12.8 KB
  • HTML data
  • 18 of bytes transferred, 22 of files
    transferred
  • average size 5.6 KB
  • File sizes, server latency, access patterns
  • all heavy-tailed most small, but some very large
  • self-similarity everywhere - lots and lots of
    bursts

33
Server-Side Architecture
34
Goals of server
  • High capacity web servers must do the following
  • rapidly update corpus of content served
  • be efficient
  • latency serve content as quickly as possible
  • throughput parallel requests from large numbers
    of clients
  • be extensible
  • data-types
  • cgi-bin programs
  • server plug-ins
  • not crash
  • remain secure

35
High-level Architecture
Filesystem cache
Network handler
Concurrency subsystem
Protocol parser
CGI interface
36
Concurrency
  • How many simultaneously open connections must a
    server handle?
  • 1,000,000 hits per day
  • 12 hits per second average
  • upwards of 50 hits per second peak (bursts,
    diurnal cycle)
  • latency
  • 10 milliseconds (out of memory) gt 1 connection
  • 50 milliseconds (off of disk) gt 3 connections
  • 200 milliseconds (CGI disk) gt 10 connections
  • 5 seconds (CGI to DB gateway) gt 250 connections
  • Depending on expected usage, need very different
    concurrency models

37
Strategies
  • Single process, single thread, serialized
  • simplest implementation, worst performance
  • perfectly fine for low traffic sites
  • Multiple processes, single serialized thread /
    process
  • Apache web server model
  • expensive (context switching, process state, )
  • Multithreaded and multiprocess
  • complex synchronization primitives needed
  • thread creation/destruction vs. thread pool
    management
  • Event driven, asynchronous I/O
  • eliminates context switch overhead, better memory
    mgmt
  • very complex and delicate program flow

38
Disk I/O
  • File system overhead
  • file system buffer management not optimal
  • dont need many of the file system facilities
  • modifying files, moving files, locking files,
    seeks
  • Alternatives
  • directly interact with disk
  • very fast, very complex
  • in-memory caching on top of file system
  • works well given high locality of server access
  • be careful to not suffer from double-buffering
  • Interaction thread subsystem and disk
  • balanced system - enough threads to saturate disk
    I/O

39
Network I/O
  • Typical server behaviour rough on network stack
  • multiple outstanding connections
  • very rapid TCP creation and teardown
  • often, very slow last-hop network segment
  • Redundant operations performed
  • checksum calculations, byte swapping,
  • Inefficiencies at packet level
  • header, body, FIN usually three separate
    round-trips
  • Poor network stack implementations
  • TIME_WAIT and IDLE PCB entries on single linked
    list
  • Nagles algorithm invoked when it shouldnt be

40
Inline scripting
  • Technology server-side includes (SSIs)
  • script embedded inside content, interpreted
    before sent back to client
  • dynamically computed content inside templates
  • authorization (cert lookup or authentication)
  • DB lookup (inventory lists, product prices, )
  • Challenges
  • similar to CGI
  • security
  • efficiency (latency and throughput)

41
Cheetah (Exokernel)
  • Direct access to hardware primitives
  • disk, network - eliminate costly OS
    generalizations
  • scatter/gather IO primitives
  • allow for common disk/network buffers (eliminate
    copy)
  • Compiler-assisted ILP
  • eliminate redundancies, staging inefficiencies
  • HTTP-specialized network stack and file system
  • precomputed HTTP headers, minimal copies
  • minimize network packets (e.g.piggyback FINs with
    data)
  • precomputed TCP/IP checksums

42
Some Parting Thoughts
43
Other things to keep in mind
  • There are non-humans on the web
  • spiders, crawlers, worms, etc, may behave badly
  • infinite FTP directory traps, request bursts, ...
  • Netscape, MSIE, and Apache set defacto standards
  • their semantics may subtly differ from standards
  • error-tolerance of popular clients/servers means
    that everybody must achieve same levels of
    tolerance
  • otherwise, you appear to be broken to users
  • e.g. Netscape not parsing comments properly
  • SSL/X.509
  • transport-level security fixes up basic auth
    problems
  • eliminates caching or proxy mechanisms
Write a Comment
User Comments (0)
About PowerShow.com