Title: Continuous Online Extraction of HTTP Traces from Packet Traces Anja Feldmann anjaresearch'att'com AT
1Continuous Online Extraction of HTTP Traces from
Packet TracesAnja Feldmann
anja_at_research.att.com ATT Labs-Research
Florham Park, NJ
2Monitoring User Accesses to the Web
- Via modified Web Browsers
- Through Web server logs
- Through Web proxy logs
- From the wire via packet monitoring
- Passive monitoring (oblivious to users)
- No impact on network performance
- Capture TCP and HTTP events
- Potential to compute statistics about or collect
downloaded page - Related work IBM, Berkeley, Virginia Tech
3Web Page Access Details
client
server
time
4From User Requests to Packets
5Challenges of HTTP trace extraction
- Packets cannot be processed in isolation
- End systems cannot be throttled
- End systems may not comply with TCP/HTTP spec.
- Arbitrary fragments of Web pages and headers per
packet - HTTP header may be spread between 10 packets
- Retransmitted data may be fragmented differently
- Use of TCP connections by HTTP
- TCP connections may be terminated at any point
- HTTP GET requests may contain data
- Demultiplexing of pkts to HTTP transaction
- Packet sniffer may lose packets (incl. TCP
connection packets) - Sanity checks on extracted information can fail
- Inaccurate content length
6Software Design Goals
- Continuous traces
- Avoid unnecessary I/O
- High speed transmission medium
- Software needs to be robust toward packet losses
- Deployable anywhere in the network
- Handle asymmetric routing
- Implications
- Memory resident Software and Data
- Priority towards packet sniffing over packet
processing - TCP connections cannot be used as demultiplexing
unit - Offline matching of HTTP requests and responses
7Hardware Design ATT Labs PacketScope
Router / Terminal Server
Out of band Communication
10-GB RAID
500-MHz Alpha Workstation
140-GB Tape Loader
Measured Link
8Software Separation of Tasks
- Packet sniffing
- Tcpdump (with IP address encryption)
- Output files of 10,000,000 packets
- Control script
- Perl script
- Action takes files from pkt sniffing and runs
header extraction - HTTP header extraction
- C-code (based on Tcpdump)
- Output Log files with HTTP and TCP events,
packet header files - HTTP header matching
- C-code
- Output Log files with matching TCP/HTTP
request/response
9Control Flow
Packet Sniffing Control Script HTTP header
extraction
10HTTP Header Extraction Basic Steps
- Reconstruction of packet sequence
- Demux packets according to Flows
- Reorder packets according to TCP sequence numbers
- Eliminate duplicate packets
- Identify missing packets
- Extraction of information
- Extract TCP and HTTP timestamp information
- Extract HTTP header info and body parts
- Summarize data part, e.g., length, sequence
number - Discard HTTP data part
11HTTP Header Extraction (cont.)
Data structure per flow list of packet
HTTP Log File
Extract HTTP
Cleanup
Runs periodically to age flows
12HTTP Header Extraction
- Data structure
- Indexed by unidirectional IP-flows (for
demultiplexing packets) - Per flow list of packets and partial extracted
HTTP information - Extraction
- Execute basic steps for a list of packets
- Cleanup
- Triggers execution of Extraction
- if more than a fixed number of packets have been
received - and after a fixed timeout
- Runs after processing a fixed number of packets
13Trace Environment ATT Labs PacketScope
Network
- Trace data
- TCP protocol events
- HTTP protocol events
- HTTP response, request headers
- URL / response codes / content length ..
- Data length
14HTTP Header Extraction (cont.)
Data structure per flow list of packet
HTTP Log File
Extract HTTP
15HTTP Header Extraction (cont.)
Data structure per flow list of packet
HTTP Log File
Extract HTTP
16HTTP Header Extraction (cont.)
Data structure per flow list of packet
HTTP Log File
Extract HTTP
Cleanup
Runs periodically to age flows