CS533 Concepts of Operating Systems Class 6 - PowerPoint PPT Presentation

Loading...

PPT – CS533 Concepts of Operating Systems Class 6 PowerPoint presentation | free to download - id: 6e0fed-MjI5N



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

CS533 Concepts of Operating Systems Class 6

Description:

CS533 Concepts of Operating Systems Class 6 Micro-kernels Mach vs L3 vs L4 Binary Compatibility Emulation libraries Trampoline mechanism Single server architecture ... – PowerPoint PPT presentation

Number of Views:6
Avg rating:3.0/5.0
Slides: 62
Provided by: walpole
Learn more at: http://www.cs.pdx.edu
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: CS533 Concepts of Operating Systems Class 6


1
CS533 Concepts of Operating Systems Class 6
  • Micro-kernels
  • Mach vs L3 vs L4

2
Binary Compatibility
  • Emulation libraries
  • Trampoline mechanism
  • Single server architecture
  • Multi-server architecture
  • IPC overhead proportional to number of servers
    (independent protection domains)

3
Optimizing IPC
  • Liedtke argues Machs overhead is due to poor
    implementation!
  • Optimized IPC implementation in L3
  • Architectural level
  • System Calls, Messages, Direct Transfer, Strict
    Process Orientation, Control Blocks.
  • Algorithmic level
  • Thread Identifier, Virtual Queues,
    Timeouts/Wakeups, Lazy Scheduling, Direct Process
    Switch, Short Messages.
  • Interface level
  • Unnecessary Copies, Parameter passing.
  • Coding level
  • Cache Misses, TLB Misses, Segment Registers,
    General Registers, Jumps and Checks, Process
    Switch.

4
L3 IPC Performance vs Mach IPC
5
L3 RPC Performance vs Previous Systems
6
But Is That Enough?
  • What is the impact on overall system performance?
  • Haertig et al explore performance and
    extensibility of L4-based Linux OS vs Mach-based
    Linux and native Linux
  • L4 has even more IPC optimizations than L3!

7
L4Linux Design Implementation
  • Fully binary compliant with Linux/X86
  • Restricted modifications to architecture-dependent
    part of Linux
  • No Linux-specific modifications to L4 kernel

8
Experiment
  • What is the penalty of using L4Linux?
  • Compare L4Linux to native Linux
  • Does the performance of the underlying
    micro-kernel matter?
  • Compare L4Linux to MkLinux
  • Does co-location improve performance?
  • Compare L4Linux to an in-kernel version of MkLinux

9
Microbenchmarks
  • measured system call overhead on shortest system
    call getpid()

10
Microbenchmarks (cont.)
  • Measures specific system calls to determine basic
    performance.

11
Macrobenchmarks
  • measured time to recompile Linux server

12
Macrobenchmarks (cont.)
  • Next use a commercial test suite to simulate a
    system under full load.

13
Performance Analysis
  • L4Linux is, on average 8.3 slower than native
    Linux. Only 6.8 slower at maximum load.
  • MkLinux 49 average, 60 at maximum.
  • Co-located MkLinux 29 average, 37 at maximum.

14
Conclusion?
  • Can hardware-based protection be made to work
    efficiently enough?
  • Did these experiments explore the cost of fine
    grained protection?

15
Spare Slides
16
The IPC Dilemma
  • IPC is very import in µ-kernel design
  • Increases modularity, flexibility, security and
    scalability.
  • Past implementations have been inefficient.
  • Message transfer takes 50 - 500µs.

17
The L3 (µ-kernel based) OS
  • A task consists of
  • Threads
  • Communicate via messages that consist of strings
    and/or memory objects.
  • Dataspaces
  • Memory objects.
  • Address space
  • Where dataspaces are mapped.

18
Redesign Principles
  • IPC performance is the Master.
  • All design decisions require a performance
    discussion.
  • If something performs poorly, look for new
    techniques.
  • Synergetic effects have to be taken into
    considerations.
  • The design has to cover all levels from
    architecture down to coding.
  • The design has to be made on a concrete basis.
  • The design has to aim at a concrete performance
    goal.

19
Achievable Performance
  • A simple scenario
  • Thread A sends a null message to thread B
  • Minimum of 172 cycles
  • Will aim at 350 cycles (7 µs)
  • Will actually achieve 250 cycles (5 µs)

20
Levels of the redesign
  • Architectural
  • System Calls, Messages, Direct Transfer, Strict
    Process Orientation, Control Blocks.
  • Algorithmic
  • Thread Identifier, Virtual Queues,
    Timeouts/Wakeups, Lazy Scheduling, Direct Process
    Switch, Short Messages.
  • Interface
  • Unnecessary Copies, Parameter passing.
  • Coding
  • Cache Misses, TLB Misses, Segment Registers,
    General Registers, Jumps and Checks, Process
    Switch.

21
Architectural Level
  • System Calls
  • Expensive! So, require as few as possible.
  • Implement two calls
  • Call
  • Reply Receive Next
  • Combines sending an outgoing message with waiting
    for an incoming message.
  • Schedulers can handle replies the same as
    requests.

22
Messages
  • Complex Messages
  • Direct String, Indirect Strings (optional)
  • Memory Objects
  • Used to combine sends if no reply is needed.
  • Can transfer values directly from senders
    variable to receivers variables.

23
Direct Transfer
  • Each address space has a fixed kernel accessible
    part.
  • Messages transferred via the kernel part
  • User A space -gt Kernel -gt User B space
  • Requires 2 copies.
  • Larger Messages lead to higher costs

24
  • Shared User Level memory (LRPC, SRC RPC)
  • Security can be penetrated.
  • Cannot check messages legality.
  • Long messages -gt address space becoming a
    critical resource.
  • Explicit opening of communication channels.
  • Not application friendly.

25
Temporary Mapping
  • L3 uses a Communication Window
  • Only kernel accessible, and exists per address
    space.
  • Target region is temporarily mapped there.
  • Then the message is copied to the communication
    window and ends up in the correct place in the
    target address space.

26
Temporary Mapping
  • Must be fast!
  • 2 level page table only requires one word to be
    copied.
  • pdir A -gt pdir B
  • TLB must be clean of entries relating to the use
    of the communication window by other operations.
  • One thread
  • TLB is always window clean.
  • Multiple threads
  • Interrupts TLB is flushed
  • Thread switch Invalidate Communication window
    entries.

27
Strict Process Orientation
  • Kernel mode handled in same way as User mode
  • One kernel stack per thread
  • May lead to a large number of stacks
  • Minor problem if stacks are objects in virtual
    memory

28
Thread Control Blocks (tcbs)
  • Hold kernel, hardware, and thread-specific data.
  • Stored in a virtual array in shared kernel space.

29
Tcb Benefits
  • Fast tcb access
  • Saves 3 TLB misses per IPC
  • Threads can be locked by unmapping the tcb
  • Helps make thread persistent
  • IPC independent from memory management

30
Algorithmic Level
  • Thread IDs
  • L3 uses a 64 bit unique identifier (uid)
    containing the thread number.
  • Tcb address is easily obtained
  • anding the lower 32 bits with a bit mask and
    adding the tcb base address.
  • Virtual Queues
  • Busy queue, present queue, polling-me queue.
  • Unmapping the tcb includes removal from queues
  • Prevents page faults from parsing/adding/deleting
    from the queues.

31
Algorithmic Level
  • Timeouts and Wakeups
  • Operation fails if message transfer has not
    started t ms after invoking it.
  • Kept in n unordered wakeup lists.
  • A new threads tcb is linked into the list t mod
    n.
  • Thread with wakeups far away are kept in a long
    time wakeup list and reinserted into the normal
    lists when time approaches.
  • Scheduler will only have to check k/n entries per
    clock interrupt.
  • Usually costs less the 4 of ipc time.

32
Algorithmic Level
  • Lazy Scheduling
  • Only a thread state variable is changed
    (ready/waiting).
  • Deletion from queues happens when queues are
    parsed.
  • Reduces delete operations.
  • Reduces insert operations when a thread needs to
    be inserted that hasnt been deleted yet.

33
Algorithmic Level
  • Short messages via registers
  • Register transfers are fast
  • 50-80 of messages 8 bytes
  • Up to 8 byte messages can be transferred by
    registers with a decent performance gain.
  • May not pay off for other processors.

34
Interface Level
  • Unnecessary Copies
  • Message objects grouped by types
  • Send/receive buffers structured in the same way
  • Use same variable for sending and receiving
  • Avoid unnecessary copies
  • Parameter Passing
  • Use registers whenever possible.
  • Far more efficient
  • Give compilers better opportunities to optimize
    code.

35
Code Level
  • Cache Misses
  • Cache line fill sequence should match the usual
    data access sequence.
  • TLB Misses
  • Try and pack in one page
  • Ipc related kernel code
  • Processor internal tables
  • Start/end of Larger tables
  • Most heavily used entries

36
Coding Level
  • Registers
  • Segment register loading is expensive.
  • One flat segment coving the complete address
    space.
  • On entry, kernel checks if registers contain the
    flat descriptor.
  • Guarantees they contain it when returning to user
    level.
  • Jumps and Check
  • Basic code blocks should be arranged so that as
    few jumps are taken as possible.
  • Process switch
  • Save/restore of stack pointer and address space
    only invoked when really necessary.

37
L4 Slides
38
Introduction
  • µ-kernels have reputation for being too slow,
    inflexible
  • Can 2nd generation µ-kernel (L4) overcome
    limitations?
  • Experiment
  • Port Linux to run on L4 (Mach 3.0)
  • Compared to native Linux, MkLinux (Linux on 1st
    gen Mach derived µ-kernel)

39
Introduction (cont.)
  • Test speed of standard OS personality on top of
    fast µ-kernel Linux implemented on L4
  • Test extensibility of system
  • pipe-based communication implemented directly on
    µ-kernel
  • mapping-related OS extensions implemented as user
    tasks
  • user-level real-time memory management
    implemented
  • Test if L4 abstractions independent of platform

40
L4 Essentials
  • Based on threads and address spaces
  • Recursive construction of address spaces by
    user-level servers
  • Initial address space s0 represents physical
    memory
  • Basic operations granting, mapping, and
    unmapping.
  • Owner of address space can grant or map page to
    another address space
  • All address spaces maintained by user-level
    servers (pagers)

41
L4Linux Design Implementation
  • Fully binary compliant with Linux/X86
  • Restricted modifications to architecture-dependent
    part of Linux
  • No Linux-specific modifications to L4 kernel

42
L4Linux Design Implementation
  • Address Spaces
  • Initial address space s0 represents physical
    memory
  • Basic operations granting, mapping, and
    unmapping.
  • L4 uses flexpages logical memory ranging from
    one physical page up to a complete address space.
  • An invoker can only map and unmap pages that have
    been mapped into its own address space

43
L4Linux Design Implementation
44
L4Linux Design Implementation
  • Address Spaces (cont.)
  • I/O ports are parts of address spaces.
  • Hardware interrupts are handled by user-level
    processes. The L4 kernel will send a message via
    IPC.

45
L4Linux Design Implementation
  • The Linux server
  • L4Linux will use a single-server approach.
  • A single Linux server will run on top of L4,
    multiplexing a single thread for system calls and
    page faults.
  • The Linux server maps physical memory into its
    address space, and acts as the pager for any user
    processes it creates.
  • The Server cannot directly access the hardware
    page tables, and must maintain logical pages in
    its own address space.

46
L4Linux Design Implementation
  • Interrupt Handling
  • All interrupt handlers are mapped to messages.
  • The Linux server contains threads that do nothing
    but wait for interrupt messages.
  • Interrupt threads have a higher priority than the
    main thread.

47
L4Linux Design Implementation
  • User Processes
  • Each different user process is implemented as a
    different L4 task Has its own address space and
    threads.
  • The Linux Server is the pager for these
    processes. Any fault by the user-level processes
    is sent by RPC from the L4 kernel to the Server.

48
L4Linux Design Implementation
  • System Calls
  • Three system call interfaces
  • A modified version of libc.so that uses L4
    primitives.
  • A modified version of libc.a
  • A user-level exception handler (trampoline) calls
    the corresponding routine in the modified shared
    library.
  • The first two options are the fastest. The third
    is maintained for compatibility.

49
L4Linux Design Implementation
  • Signalling
  • Each user-level process has an additional thread
    for signal handling.
  • Main server thread sends a message for the signal
    handling thread, telling the user thread to save
    its state and enter Linux

50
L4Linux Design Implementation
  • Scheduling
  • All thread scheduling is down by the L4 kernel
  • The Linux servers schedule() routine is only
    used for multiplexing its single thread.
  • After each system call, if no other system call
    is pending, it simply resumes the user process
    thread and sleeps.

51
L4Linux Design Implementation
  • Tagged TLB Small Space.
  • In order to reduce TLB conflicts, L4Linux has a
    special library to customize code and data for
    communicating with the Linux Server
  • The emulation library and signal thread are
    mapped close to the application, instead of
    default high-memory area.

52
Performance
  • What is the penalty of using L4Linux?
  • Compare L4Linux to native Linux
  • Does the performance of the underlying
    micro-kernel matter?
  • Compare L4Linux to MkLinux
  • Does co-location improve performance?
  • Compare L4Linux to an in-kernel version of MkLinux

53
Microbenchmarks
  • measured system call overhead on shortest system
    call getpid()

54
Microbenchmarks (cont.)
  • Measures specific system calls to determine basic
    performance.

55
Macrobenchmarks
  • measured time to recompile Linux server

56
Macrobenchmarks (cont.)
  • Next use a commercial test suite to simulate a
    system under full load.

57
Performance Analysis
  • L4Linux is, on average 8.3 slower than native
    Linux. Only 6.8 slower at maximum load.
  • MkLinux 49 average, 60 at maximum.
  • Co-located MkLinux 29 average, 37 at maximum.

58
Extensibility Performance
  • A micro-kernel must provide more than just the
    features of the OS running on top of it.
  • Specialization improved implementation of Os
    functionality
  • Extensibility permits implementation of new
    services that cannot be easily added to a
    conventional OS.

59
Pipes and RPC
  • First five (1) use the standard pipe mechanism of
    the Linux kernel.
  • (2) Is asynchronous and uses only L4 IPC
    primitives. Emulates POSIX standard pipes,
    without signalling. Added thread for buffering
    and cross-address-space communication.
  • (3) Is synchronous and uses blocking IPC without
    buffering data.
  • (4) Maps pages into the receivers address space.

60
Virtual Memory Operations
  • The Fault operation is an example of
    extensibility measures the time to resolve a
    page fault by a user-defined pager in a separate
    address space.
  • Trap Latency between a write operation to a
    protected page, and the invocation of related
    exception handler.
  • Appel1 Time to access a random protected
    page. The fault handler unprotects the page,
    protects some other page, and resumes.
  • Appel2 Time to access a random protected page
    where the fault handler only unprotects the page
    and resumes.

61
Conclusion
  • Using the L4 micro-kernel imposes a 5-10
    slowdown to native Linux. Much faster than
    previous micro-kernels.
  • Further optimizations such as co-locating the
    Linux Server, and providing extensibility could
    improve L4Linux even further.
About PowerShow.com