SCANLite: Enterprisewide analysis on the cheap - PowerPoint PPT Presentation

About This Presentation
Title:

SCANLite: Enterprisewide analysis on the cheap

Description:

Ran a fitness test using 7 analysis routines ... Original fitness test used CPU time. Gave less variable performance curves for modeling ... – PowerPoint PPT presentation

Number of Views:39
Avg rating:3.0/5.0
Slides: 35
Provided by: craigs54
Category:

less

Transcript and Presenter's Notes

Title: SCANLite: Enterprisewide analysis on the cheap


1
SCAN-LiteEnterprise-wide analysis on the cheap
  • Craig Soules, Kimberly Keeton, Brad Morrey

2
Enterprise information management
  • Search
  • Clustering
  • Provenance
  • Classification
  • IT Trending
  • Virus scanning

Metadata Server
3
Enterprise information management
Metadata Server
Data is duplicated across machines! Duplicate
analysis is wasted work
4
Issues
  • Analysis programs conflict on clients
  • Contend for system resources (memory, disk)
  • Clients repeat work
  • Duplicate files on multiple clients
  • Client foreground workloads are impacted
  • Work exceeds available idle time on busy clients

5
Approaches
  • Reduce resource contention

Client
6
Approaches
  • Avoid duplicate work

7
Approaches
  • Leverage duplication to balance client load
  • Delay analysis to identify all duplicates

Clients
Global Scheduler
8
Solutions
  • Local scheduler
  • Coordinates analyses to reduce resource
    contention
  • Up to 60 improvement
  • Global scheduler
  • Identifies duplicates to remove work
  • Balance load
  • 40 reduction in impact to foreground tasks

9
Local scheduling
  • Traditionally, analyses are separate programs
  • Scheduling left to the operating system
  • Potentially at different times
  • Each program identifies files to scan
  • Each program opens and reads file data

Disk
10
Unified local scheduling
  • Each analysis routine is a separate thread
  • Control thread manages shared tasks
  • Identify files to scan, and open/read file data
  • Shared memory buffer distributes file data

ControlThread
Disk
Shared Memory
11
Local scheduling performance
  • Ran a fitness test using 7 analysis routines
  • 42 data sets, each containing files of a fixed
    size
  • Ran both approaches over each data set
  • Calculated per-file elapsed scan time
  • Dual-core 2.8GHz P4 Xeon, 4GB RAM, 70GB RAID 1
  • Seven-at-once
  • Run each analysis routine separately at the same
    time
  • Unified
  • SCAN-Lites unified local scheduling approach

12
Elapsed time vs. CPU time
  • Original fitness test used CPU time
  • Gave less variable performance curves for
    modeling
  • Disk contention shows up in elapsed time
  • CPU time is multiplexed
  • Elapsed time is not

This is very bad
13
Local scheduling results
17 - 60 improvement
Seven-at-once benefits from deep disk queues, but
this hurts foreground apps
Small random I/Os have worse interaction than
larger ones
14
Global scheduler
  • Two goals
  • Reduce additional work from duplicate files
  • Utilize duplication to schedule work to the
    best client
  • Two-phase scanning
  • Phase one identify duplicate files using content
    hashing
  • Phase two analyze one copy at the appropriate
    client
  • Delaying between phase one and two provides
    opportunity for additional duplication and
    deletion

15
Traditional scanning
Clients
Server
16
Phase one Duplicate detection
Clients
Server
17
Phase two Scheduling
Clients
Server
18
When to schedule
  • Clients upload hashes each scheduling period
  • The freshness specifies a deadline by which new
    data must be analyzed

Schedule before this period
Scheduling here gives one option
Scheduling here gives three options
Scheduling Period
Time
19
How to schedule
  • Scheduling is a bin packing problem
  • Files are balls, clients are bins
  • Size of bins is available idle time
  • Color of balls/bins equates to location of
    duplicates
  • Size of balls is time required for analysis

20
How to schedule
  • We use a greedy heuristic for scheduling
  • Consider idle time and machine priorities
  • See paper for details

21
Work ahead
  • Start by scheduling all work that meets freshness
  • Schedule additional work on still idle machines
  • Any remaining idle time can be used for
    additional work
  • We refer to this as work ahead

22
Two-phase scanning Trade-offs
Clients
23
Two-phase scanning Trade-offs
Clients
24
Two-phase scanning Trade-offs
  • If cost of hashing exceeds the additional work
    from duplicates, then one-phase scanning is
    better
  • Analysis of hashing costs using SHA-1 indicate
    that 3 data duplication is the minimum
  • Do we see that in practice?

25
Duplication in enterprise data
  • Examined two data sources
  • 100 user home directories from a central server
  • 12 user productivity machines
  • In both datasets, saw 10 duplication
  • Even more with system files, email servers,
    sharepoints, etc.
  • This is sufficient duplication for work reduction

4/7 duplication
26
Global scheduling policies
  • Traditional
  • One-phase scanning, scan all copies
  • Rand
  • Two-phase scanning, random scheduling
  • BestPlace
  • Two-phase scanning, greedy scheduling
  • BestPlaceTime
  • Two-phase scanning, greedy scheduling work
    ahead
  • Opt
  • Unreplicated data only, delayed work ahead

27
Metrics
  • Total Work
  • Total elapsed time spent on analysis and hashing
  • Client Impact
  • Time spent that exceeded client idle time

28
Metrics
  • Metrics calculated for each day
  • Summed over the entire simulation period

29
Experimental setup
  • Implemented a simulator to test a variety of
    machine configurations and scheduling policies
  • Config 50 high priority blades, 50 low priority
    laptops
  • Blades were modeled after
  • Dual-core 2.8GHz P4 Xeon, 4GB RAM, 70GB RAID 1
  • Laptops were modeled after
  • 2GHz Pentium M, 1.5GB RAM, 60GB SATA
  • Simulated 30 days
  • Daily creation rates and layouts from traced
    workloads
  • Freshness of 3 days, scheduling period of 1 day

30
Total work
Prefers faster blade machines over laptops,
increasing their total work to reduce client
impact
Doing work ahead of the freshness delay means
analyzing files that would have been deleted
Removes duplicate work, reducing the total work
done
31
Client impact
By doing work ahead of the freshness deadline,
SCAN-Lite takes better advantage of idle time
Choosing the best place helps hit the idle time
targets, reducing average client impact
Less work means less impact
Theoretical OPT only 8 better than BestPlaceTime
32
Summary
  • Reducing local scanning interference is critical
  • 17 - 60 improvement from reduced contention
  • Two-phase scanning reduces analysis overheads
  • Reduce total work to near single-copy costs
  • Reduced client impact by up to 40 on our workload

33
Future work
  • This is an initial system for reducing analysis
    costs
  • Many improvements remain!
  • Vary freshness delays
  • Different applications may have different
    requirements
  • Provide freshness and scan priorities to clients
  • Could prioritize scan order to not exceed client
    idle times
  • Try more workloads
  • May need better bin packing algorithms

34
Summary
  • Ever increasing number of analyses in the
    enterprise
  • Search, provenance, trending, clustering,
    classification, etc.
  • Local scheduling to reduce resource contention on
    clients
  • Up to 60 performance improvement
  • Two-phase scanning to reduce work and balance
    load
  • Delay analysis work to identify duplicate work
  • Global scheduling to balance load
  • Reduced client impact by up to 40 on our workload

35
Getting a handle on enterprise data
  • Unstructured information growing at XX per year
  • Increasing number of needs for metadata
  • eDiscovery
  • Worker productivity and search
  • IT trending and historical analysis
  • Lots of different analysis to perform
  • Term vectors, fingerprints, feature vectors,
    usage statistics, etc.
  • Data is spread across file servers, web servers,
    email servers, laptops, desktops, backups, etc.

36
Where to perform analysis?
  • On backups?
  • Not all data is backed up, encrypted, utilized
  • On idle servers?
  • Requires data migration strategies, may break
    privacy
  • On end nodes?
  • May interrupt foreground workloads, frustrate
    users
  • All solutions desire minimizing work and
    balancing load to reduce required resources

37
The problems
  • Most analysis tools run in isolation
  • Tools compete for resources locally, create
    interference
  • Replicated data creates replicated work
  • Tools produce the same results in multiple
    locations
  • Machines have difference characteristics
  • Creation rates, performance, idle time, etc.
  • Goal perform analysis at the best time and place

38
Best place and time?
A
B
C
D
39
Solution Improve scheduling
  • Local scheduler to coordinate analysis tasks
  • Single resource controller to prevent competition
  • Global scheduler to single-instance analysis
  • Centralize decision of when and where to analyze

40
Local scheduling
  • Prefetch thread reads data from disk once
  • Analysis routines run in separate parallel
    threads
  • Shared memory buffer distributes data to routines

Analysis Threads
Prefetch Thread
Producer/Consumer Buffer
41
Traditional One-phase scanning
Server
Client
Metadata Store
Metadata
42
SCAN-Lite Two-phase scanning
Server
Client
Metadata Store
43
Global scheduling
  • Time broken into scheduling periods based on some
    freshness delay (max time until data scan)
  • Starting each scheduling period, the global
    scheduler picks which client will scan which data
  • First schedule data that has met its freshness
    delay
  • Idle time, priorities, worst-fit, and ordering
  • Second schedule any possible additional data
  • Work-ahead

44
Idle time, priorities, and worst-fit
  • For a given piece of data
  • Choose the set of machines that have available
    idle time
  • If none, then choose all machines
  • From that, choose the machines with the highest
    priority
  • From that, choose the machine with the most idle
    time
  • If none, choose the machine with the least client
    impact

45
Ordering
  • There is still a problem

IdleTime
P2
P1
46
Ordering
  • Assign each piece of data a number based on the
    number of machines at each priority class
  • Order all data by its ordering number

IdleTime
P2
P1
47
Work ahead
  • Once all data that has met its freshness delay
    has been scheduled, assign additional data to any
    machines with available idle time

48
How to schedule
  • First, schedule any work that will meet its
    freshness deadline during this scheduling period
  • Second, schedule any additional work that will
    fit within the remaining idle time of clients

49
Local scheduling results
50
Local performance improvements
  • What happens when one or more analysis routines
    try to improve performance?
  • For example, using direct I/O to reduce memory
    footprint, and thus impact on client workloads
  • Seven Direct
  • Analysis programs implement direct I/O
  • Unified Direct
  • SCAN-Lite implements direct I/O

51
Local scheduling with direct I/O
Write a Comment
User Comments (0)
About PowerShow.com