Title: Understanding Linux Kernel - Booting, Syscalls, Interrupts
 1Understanding Linux Kernel- Booting, Syscalls, 
Interrupts Context Switching
By  Jayant Upadhyay 2003CS50214 Pankaj K. 
Sharma 2003CS50219 Sohit Bansal 
2003CS50224 Akshay Gaur 2003CS50209 
 2Overview of Booting
- The process can be divided into following six 
logical stages  - BIOS selects the boot device 
 - BIOS loads the boot sector from the boot device 
 - Boot-sector loads setup, decompression routines 
and compressed kernel image  - Kernel is uncompressed in protected mode 
 - Low level initialization is performed by the asm 
code  - High-level C initialization
 
  3BIOS POST
- POST  Power On Self Test 
 - Power supply starts the clock generator and 
asserts POWERGOOD signal on the bus  - CPU RESET line is asserted 
 - POST checks are performed with interrupts 
disabled  - IVT initialized at address zero 
 - BIOS bootstrap function is invoked via INT 0x19. 
This loads track 0, sector 1 at physical address 
0x7C00(0x07C00000) 
  4Boot-sector  Setup
-  The boot-sector to boot linux kernel could 
be either  - Linux boot-sector(arch/i386/boot/bootsect.S) 
 - LILO (or other bootloaders) boot-sector
 
  5Linux Boot-sector
- bootsector.S 
 - Firstly moves the bootsectors code from 0x7C00 
to 0x90000  - Then it jumps to the newly made copy of 
bootsector i.e. in segment 0x90000  - Prepares the stack at INITSEG0x4000-0xC 
 - This is where the limit on setup size comes from 
 - Setup sectors are loaded immediately after the 
bootsector i.e. at physical address using BIOS 
service INT 0x13  
  6- If loading is failed due to some reason error 
code is dumped n it retry in endless loop  - If loading setup_sects sectors of setup code 
succeeded we jump to label ok_load_setup  - Kernel image is then loaded 0x10000. This is done 
to preserve the firmware data in low memory ( 
0-64K )  - After the kernel is loaded we jump to 
SETUPSEG0(arch/i386/boot/setup.S) 
  7- setup.S 
 - Once the data is no longer needed (e.g. no more 
calls to BIOS) it is overwritten by moving the 
entire (compressed) kernel image from 0x10000 to 
0x1000.  - sets things up for protected mode and jumps to 
0x1000 which is the head of the compressed 
kernel, i.e. arch/386/boot/compressed/head.S,misc
.c  - This sets up stack and calls decompress_kernel() 
which uncompresses the kernel to address 0x100000 
and jumps to it.  
  8How to load a big kernel?
- The setup sectors are loaded as usual at 0x90200, 
but the kernel is loaded 64K chunk at a time 
using a special helper routine that calls BIOS to 
move data from low to high memory.  - This helper routine is referred to by 
bootsect_kludge in bootsect.S and is defined as 
bootsect_helper in setup.S. The bootsect_kludge 
label in setup.S contains the value of setup 
segment and the offset of bootsect_helper code in 
it so that bootsector can use the lcall 
instruction to jump to it (inter-segment jump).  - This routine uses BIOS service int 0x15 
(ax0x8700) to move to high memory and resets es 
to always point to 0x10000. This ensures that the 
code in bootsect.S doesn't run out of low memory 
when copying data from disk. 
  9Using LILO as bootloader
- There are several advantages in using a 
specialised bootloader (LILO) over a bare bones 
Linux bootsector  - Ability to choose between multiple Linux kernels 
or even multiple OSes.  - Ability to pass kernel command line parameters 
 - Ability to load much larger bzImage kernels - up 
to 2.5M vs 1M.  - Old versions of LILO (v17 and earlier) could not 
load bzImage kernels. The newer versions (as of a 
couple of years ago or earlier) use the same 
technique as bootsectsetup of moving data from 
low into high memory by means of BIOS services. 
  10High Level Initialization
- By "high-level initialisation" we consider 
anything which is not directly related to 
bootstrap, even though parts of the code to 
perform this are written in asm, namely 
arch/i386/kernel/head.S which is the head of the 
uncompressed kernel. The following steps are 
performed  - Initialise segment values (ds  es  fs  gs 
 __KERNEL_DS  0x18).  - Initialise page tables. 
 - Enable paging by setting PG bit in cr0. 
 - Zero-clean BSS (on SMP, only first CPU does 
this).  - Copy the first 2k of bootup parameters (kernel 
commandline).  - Check CPU type using EFLAGS and, if possible, 
cpuid, able to detect 386 and higher.  - The first CPU calls start_kernel(), all others 
call arch/i386/kernel/smpboot.cinitialize_seconda
ry() if ready1, which just reloads esp/eip and 
doesn't return. 
  11- The init/main.cstart_kernel() is written in C 
and does the following  - Perform arch-specific setup (memory layout 
analysis, copying boot command line again, etc.).  - Print Linux kernel "banner" containing the 
version, compiler used to build it etc. to the 
kernel ring  - buffer for messages. This is taken from the 
variable linux_banner defined in init/version.c 
and is the same string as displayed by cat 
/proc/version.  - Initialise traps, irqs, data required for 
scheduler.  - Parse boot commandline options  Initialise 
console.  - If module support was compiled into the kernel, 
initialise dynamical module loading facility. 
  12- If "profile" command line was supplied, 
initialise profiling buffers.  - kmem_cache_init(), initialise most of slab 
allocator.  - Enable interrupts. 
 - Calculate BogoMips value for this CPU. 
 - Call mem_init() which calculates max_mapnr, 
totalram_pages and high_memory and prints out the 
"Memory ..." line.  - kmem_cache_sizes_init(), finish slab allocator 
initialisation.  - Initialise data structures used by procfs. 
 - fork_init(), create uid_cache, initialise 
max_threads based on the amount of memory  - available and configure RLIMIT_NPROC for 
init_task to be max_threads/2. 
  13- Create various slab caches needed for VFS, VM, 
buffer cache, etc.  - If System V IPC support is compiled in, 
initialise the IPC subsystem. Note that for 
System V shm, this includes mounting an internal 
(in-kernel) instance of shmfs filesystem.  - If quota support is compiled into the kernel, 
create and initialise a special slab cache for 
it.  - Perform arch-specific "check for bugs" and, 
whenever possible, activate workaround for 
processor/bus/etc bugs. Comparing various 
architectures reveals that "ia64 has no bugs" and 
"ia32 has quite a few bugs", good example is 
"f00f bug" which is only checked if kernel is 
compiled for less than 686 and worked around 
accordingly.  - Set a flag to indicate that a schedule should be 
invoked at "next opportunity" and create a kernel 
thread init() which execs execute_command if 
supplied via "init" boot parameter, or tries to 
exe /sbin/init, /etc/init, /bin/init, /bin/sh in 
this order if all these fail, panic with 
"suggestion" to use "init" parameter.  - Go into the idle loop, this is an idle thread 
with pid0. 
  14Interrupts and Exceptions
- Hardware support for getting CPUs attention 
 - Often transfers from user to kernel mode 
 - Nested interrupts are possible interrupt can 
occur while an interrupt handler is already 
executing (in kernel mode)  - Asynchronous device or timer generated 
 - Unrelated to currently executing process 
 - Synchronous immediate result of last instruction 
 - Often represents a hardware error condition 
 - Intel terminology and hardware 
 - Irqs, vectors, IDT, gates, PIC, APIC 
 - Interrupt handling data structures, flow of 
control  - Handlers softirqs, tasklets, bottom halves
 
  15Basic Ideas
- Similar to context switch (but lighter weight) 
 - Hardware saves a small amount of context on stack 
 - Includes interrupted instruction if restart 
needed  - Execution resumes with special iret instruction 
 - Structure top and bottom halves 
 - Top-half do minimum work and return 
 - Bottom-half deferred processing 
 - Handler code executed in response 
 - Possible to temporarily mask interrupts 
 - Handlers need not be reentrant 
 - But other interrupts can occur, causing nesting
 
  16Interrupts vs Exceptions
- Varying terminology but for Intel 
 - Interrupt (synchronous, device generated) 
 - Maskable device-generated, associated with IRQs 
(interrupt request lines) may be temporarily 
disabled (still pending)  - Nonmaskable some critical hardware failures 
 - Exceptions (asynchronous) 
 - Processor-detected 
 - Faults  correctable (restartable) e.g. page 
fault  - Traps  no reexecution needed e.g. breakpoint 
 - Aborts  severe error process usually terminated 
(by signal)  - Programmed exceptions (software interrupts) 
 - int (system call), int3 (breakpoint) 
 - into (overflow), bounds (address check)
 
  17Vectors, IDT
- Vector index (0-255) into descriptor table (IDT) 
 - Special register idtr points to table (use lidt 
to load)  - IDT table of gate descriptors 
 - Segment selector  offset for handler 
 - Descriptor Privilege Level (DPL) 
 - Gates (slightly different ways of entering 
kernel)  - Task gate includes TSS to transfer to (not used 
by Linux)  - Interrupt gate disables further interrupts 
 - Trap gate further interrupts still allowed 
 - Vector assignments 
 - Exceptions, NMI are fixed 
 - Maskable interrupts can be assigned as needed
 
  18PIC
- Programmable Interrupt Controller (PIC) 
 - chip between devices and cpu 
 - Fixed number of wires in from devices 
 - IRQs Interrupt ReQuest lines 
 - Single wire to CPU  some registers 
 - PIC translates IRQ to vector 
 - Raises interrupt to CPU 
 - Vector available in register 
 - Waits for ack from CPU 
 - Other interrupts may be pending 
 - Possible to mask interrupts at PIC or CPU 
 - Early systems cascaded two 8 input chips (8259A)
 
  19Interrupt Handling Components
IRQs
Memory Bus
0
0
INTR
idtr
15
Mask points
255 
 20IO-APIC, LAPIC
- Advanced PIC for SMP systems 
 - Used in all modern systems 
 - Interrupts routed to CPU over system bus 
 - IPI inter-processor interrupt 
 - Local APIC versus frontend IO-APIC 
 - Devices connect to front-end IO-APIC 
 - IO-APIC communicates (over bus) with Local APIC 
 - Interrupt routing 
 - Allows broadcast or selective routing of 
interrupts  - Need to distribute interrupt handling load 
 - Routes to lowest priority process 
 - Special register Task Priority Register (TPR) 
 - Arbitrates (round-robin) if equal priority
 
  21Intel Exceptions
- Architecture (processor) dependent 
 - Intel has about 20 (out of 32 possible) 
 - Most exceptions send signal to current process 
 - Default action often just kills process 
 - Page fault is the one exception very complex 
handler  - Some examples 
 - 0 SIGFPE Divide by zero error 
 - 3 SIGTRAP Breakpoint 
 - 6 SIGILL Invalid op-code 
 - 11 SIGBUS Segment not present 
 - 12 SIGBUS Stack overflow 
 - 13 SIGSEGV General protection fault (DPL 
violation)  - 14 SIGSEGV Page fault
 
  22Hardware Handling
- On entry 
 - Which vector? 
 - Get corresponding descriptor in IDT 
 - Find specified descriptor in GDT (for handler) 
 - Check privilege levels (CPL, DPL) 
 - If entering kernel mode, set kernel stack 
 - Save eflags, cs, (original) eip on stack 
 - -gt Jump to appropriate handler 
 - Assembly code prepares C stack, calls handler 
 - On return (i.e. iret) 
 - Restore registers from stack 
 - If returning to user mode, restore user stack 
 - Clear segment registers (if privileged selectors)
 
  23Nested Execution
- Interrupts can be interrupted 
 - By different interrupts handlers need not be 
reentrant  - No notion of priority in Linux 
 - Small portions execute with interrupts disabled 
 - Interrupts remain pending until acked by CPU 
 - Exceptions can be interrupted 
 - By interrupts (devices needing service) 
 - Exceptions can nest two levels deep 
 - Exceptions indicate coding error 
 - Exception code (kernel code) shouldnt have bugs 
 - Page fault is possible (trying to touch user 
data)  
  24IDT Initialization
- Initialized once by BIOS in real mode 
 - Linux re-initializes during kernel init 
 - Must not expose kernel to user mode access 
 - start by zeroing all descriptors 
 - Linux lingo 
 - Interrupt gate (same as Intel no user access) 
 - Not accessible from user mode 
 - System gate (Intel trap gate user access) 
 - Used for int, int3, into, bounds 
 - Trap gate (same as Intel no user access) 
 - Used for exceptions 
 
  25Exception Handling
- Some exceptions push error code on stack 
 - IDT points to small individual handlers 
(assembly)  - handler_name pushl 0 // placeholder if no 
error code pushl do_handler_name jmp 
error_code  - Common code sets up for C call 
 - Pops handler address from stack, calls 
 - All handlers check if kernel mode 
 - Exceptions caused by touching bad syscall params 
 - Return to userland with error code 
 - Other exceptions-gt die() // kernel Oops 
 - Most handlers just generate signal for current 
 - current-gttss.error_code  error_code 
 - current-gttss.trap_no  vector 
 - force_sig(sig_number, current)
 
  26Interrupt Handling
- More complex than exceptions 
 - Requires registry, deferred processing, etc. 
 - Some issues 
 - IRQs are often shared all handlers (ISRs) are 
executed so they must query device  - IRQs are dynamically allocated to reduce 
contention  - Example floppy allocates when accessed 
 - Three types of actions 
 - Critical Top-half (interrupts disabled  
briefly!)  - Example acknowledge interrupt 
 - Non-critical Top-half (interrupts enabled) 
 - Example read key scan code, add to buffer 
 - Non-critical deferrable Do it later 
(interrupts enabled)  - Example copy keyboard buffer to terminal handler 
process  - Softirqs, tasklets, bottom halves (deprecated)
 
  27IRQ, Vector Assignment
- PCI bus usually assigns IRQs at boot 
 - Vectors usually IRQ  32 
 - Below 32 reserved for non-maskable, execeptions 
 - Vector 128 used for syscall 
 - Vectors 251-255 used for IPI 
 - Some IRQs are fixed by architecture 
 - IRQ0 interval timer 
 - IRQ2 cascade pin for 8259A 
 - See /proc/interrupts for assignments
 
  28IRQ Data Structures
- irq_desc array of IRQ descriptors 
 - status (flags), lock, depth (for nested disables) 
 - handler PIC device driver! 
 - action linked list of irqaction structs 
(containing ISRs)  - irqaction ISR info 
 - handler actual ISR! 
 - flags 
 - SA_INTERRUPT interrupts disabled if set 
 - SA_SHIRQ sharing allowed 
 - SA_SAMPLE_RANDOM input for /dev/random entropy 
pool  - name for /proc/interrupts 
 - dev_id, next 
 - irq_stat per-cpu counters (for /proc/interrupts)
 
  29Interrupt Processing
- BUILD_IRQ macro generates 
 - IRQn_interrupt 
 - pushl n-256 // negative to distinguish syscalls 
 - jmp common_interrupt 
 - Common code 
 - common_interrupt 
 - SAVE_ALL // save a few more registers than 
hardware  - call do_IRQ 
 - jmp ret_from_intr 
 - do_IRQ() is C code that handles all interrupts
 
  30Low-level IRQ Processing
- do_IRQ() 
 - get vector, index into irq_desc for appropriate 
struct  - grab per-vector spinlock, ack (to PIC) and mask 
line  - set flags (IRQ_PENDING) 
 - really process IRQ? (may be disabled, etc.) 
 - call handle_IRQ_event() 
 - some logic for handling lost IRQs on SMP systems 
 - handle_IRQ_event() 
 - enable interrupts if needed (SA_INTERRUPT clear) 
 - execute all ISRs for this vector 
 - action-gthandler(irq, action-gtdev_id, regs) 
 
  31Deferrable Functions
- Bottom-halves (deprecated) 
 - Old static array of function pointers that are 
marked for execution (can be masked temporarily)  - Executed on kernel to user transition 
 - Executed serially (globally) on SMP system 
 - Mostly for networking code 
 - Tasklets Different tasklets can execute 
concurrently  - Softirqs The same softirq can execute 
concurrently  - Layered implementation 
 - Bottom-halves implemented using tasklets 
 - Tasklets implemented using softirqs 
 - When executed? (pretty frequently) 
 - When last (nested) interrupt handler terminates 
  - When network packet receiver 
 - When idle per-cpu ksoftirqd kernel thread 
 - Lots of detail in book a bit complex  
 
  32Return Code Path
- Interleaved assembly entry points 
 - ret_from_exception() 
 - ret_from_inr() 
 - ret_from_sys_call() 
 - ret_from_fork() 
 - See flowchart in text (Fig 4-5 page 158) 
 - Things that happen 
 - Run scheduler if necessary 
 - Return to user mode if no nested handlers 
 - Restore context, user-stack, switch mode 
 - Re-enable interrupts if necessary 
 - Deliver pending signals 
 - (Some DOS emulation stuff  VM86 Mode)
 
  33System Calls 
 34System Calls
- Interface between user-level processes and 
hardware devices.  - CPU, memory, disks etc. 
 - Make programming easier 
 - Let kernel take care of hardware-specific issues. 
 - Increase system security 
 - Let kernel check requested service via syscall. 
 - Provide portability 
 - Maintain interface but change functional 
implementation.  
  35Mode, Space, Context
- Mode hardware restricted execution state 
 - restricted access, privileged instructions 
 - user mode vs. kernel mode 
 - dual-mode architecture, protected mode 
 - Intel supports 4 protection rings 0 kernel, 1 
unused, 2 unused, 3 user  - Space kernel (system) vs. user (process) address 
space  - requires MMU support (virtual memory) 
 - userland any process address space there are 
many user address spaces  - reality kernel is often mapped into user process 
space  - Context kernel activity on behalf of ??? 
 - process on behalf of current process 
 - system unrelated to current process (maybe no 
process!)  - example interrupt context 
 - blocking not allowed!
 
35 
 36POSIX APIs
- API  Application Programmer Interface. 
 - Function defn specifying how to obtain service. 
 - By contrast, a system call is an explicit request 
to kernel made via a software interrupt.  - Standard C library (libc) contains wrapper 
routines that make system calls.  - e.g., malloc, free are libc routines that use the 
brk system call.  - POSIX-compliant  having a standard set of APIs. 
 - Non-UNIX systems can be POSIX-compliant if they 
offer the required set of APIs. 
  37Interrupts and Exceptions
- Interrupts - async device to cpu communication 
 - example service request, completion notification 
 - aside IPI  interprocessor interrupt (another 
cpu!)  - system may be interrupted in either kernel or 
user mode  - interrupts are logically unrelated to current 
processing  - Exceptions - sync hardware error notification 
 - example divide-by-zero (AU), illegal address 
(MMU)  - exceptions are caused by current processing 
 - Software interrupts (traps) 
 - synchronous simulated interrupt 
 - allows controlled entry into the kernel from 
userland  
37 
 38Linux System Calls
- Invoked by executing int 0x80. 
 - Programmed exception vector number 128. 
 - CPU switches to kernel mode  executes a kernel 
function.  - Calling process passes syscall number identifying 
system call in eax register (on Intel 
processors).  - Syscall handler responsible for 
 - Saving registers on kernel mode stack. 
 - Invoking syscall service routine. 
 - Exiting by calling ret_from_sys_call().
 
  39Linux System Calls
- System call dispatch table 
 - Associates syscall number with corresponding 
service routine.  - Stored in sys_call_table array having up to 
NR_syscall entries (usually 256 maximum).  - nth entry contains service routine address of 
syscall n. 
  40Kernel Entry and Exit
exceptions (error traps)
trap
80h
boot
IPI inter- processor interrupt
device dialog
interrupt
page faults
40 
 41Initializing System Calls
- trap_init() called during kernel initialization 
sets up the IDT (interrupt descriptor table) 
entry corresponding to vector 128  - set_system_gate(0x80, system_call) 
 - A system gate descriptor is placed in the IDT, 
identifying address of system_call routine.  - Does not disable maskable interrupts. 
 - Sets the descriptor privilege level (DPL) to 3 
 - Allows User Mode processes to invoke exception 
handlers (i.e. syscall routines). 
  42The system_call() Function
- Saves syscall number  CPU registers used by 
exception handler on the stack, except those 
automatically saved by control unit.  - Checks for valid system call. 
 - Invokes specific service routine associated with 
syscall number (contained in eax)  - call sys_call_table(0, eax, 4) 
 - Return code of system call is stored in eax.
 
  43Parameter Passing
- As the syscall number, user-space must relay the 
parameters to the kernel during the exception 
trap  - The parameters are stored in registers onx86, 
the registers ebx, ecx, edx, esi, and edi 
contain, in order, the first five arguments.  - In the unlikely case of six or more arguments, a 
single register is used to hold a pointer to 
user-space where all the parameters reside  - The return value is sent to user-space via 
register, eax on x86 
  44Writing a system call for Linux
- Define its purpose, i.e., exactly one purpose 
 - Decide arguments, return value, and error codes 
 - Design the interface with forward compatibility 
in mind  - return appropriate error codes
 
- Verifying the Parameters The pointer points 
to a region of memory in user-space The 
pointer points to a region of memory in the 
processs address space If reading, the 
memory is marked readable. If writing, the memory 
is marked writable 
  45- copy_to_user(usr_dst, krnl_src, len) 
 - copy_from_user(krnl_dst, usr_src, len)
 
- Asmlinkage long sys_scopy(unsigned long src, 
unsigned long dst, unsigned long len)  -  
 -  unsigned long buf 
 -  /fail if the kernel wordsize and user wordsize 
do not match /  -  if (len ! sizeof(buf)) 
 -  return EINVAL 
 -  if (copy_from_user(buf, src, len)) 
 -  return EFAULT 
 -  if (copy_to_user(dst, buf, len)) 
 -  return EFAULT 
 -  return len /return amount of data copied / 
 
  46System Call Context
- In process context, the kernel is capable of 
sleeping (e.g., blocked on a call or calling 
schedule()) make use of the majority of the 
kernels functionality simplifying kernel 
programming  - In process context, the kernel is preemptible 
system calls must be reentrant (the current task 
may be preempted by another task that may then 
execute the same system call). 
  47Blocking System Calls
- system calls may block in the kernel 
 - slow system calls may block indefinitely 
 - reads, writes of pipes, terminals, net devices 
 - some ipc calls, pause, some opens and ioctls 
 - disk io is NOT slow (it will eventually complete) 
 - blocking slow calls may be interrupted by a 
signal  - returns EINTR 
 - problem slow calls must be wrapped in a loop 
 - BSD introduced automatic restart of slow 
interrupted calls  - POSIX didnt specify semantics 
 - Linux 
 - no automatic restart by default 
 - specify restart when setting signal handler 
(SA_RESTART) 
47 
 48Linux Files Relating to Syscalls
- Main files 
 - arch/i386/kernel/entry.S 
 - System call and low-level fault handling 
routines.  - include/asm-i386/unistd.h 
 - System call numbers and macros. 
 - kernel/sys.c 
 - System call service routines.
 
  49arch/i386/kernel/entry.S
-  Add system calls by appending entry to 
sys_call_table  - .long SYMBOL_NAME(sys_my_system_call)
 
  50include/asm-i386/unistd.h
- Each system call needs a number in the system 
call table  - e.g., define __NR_write 4 
 - define __NR_my_system_call nnn, where nnn is 
next free entry in system call table. 
  51kernel/sys.c
- Service routine bodies are defined here 
 - e.g., asmlinkage retval 
 -  sys_my_system_call (parameters)  
 -  body of service routine 
 -  return retval 
 -  
 
  52Example System Calls
- sys_foo, do_foo idiom 
 - all system calls proper begin with sys_ 
 - often delegate to do_ function for the real work 
 - asmlinkage 
 - gcc magic to keep parameters on the stack 
 - avoids register optimizations 
 - sys_ni_syscall 
 - just return ENOSYS! 
 - guards position 0 in table (catch uninitialized 
bugs)  - fills holes for obsolete syscalls or library 
implemented calls 
52 
 53Example System Calls sys_time
- kernel/time.c sys_time 
 - just return the number of seconds since Jan 1, 
1970  - available as volatile CURRENT_TIME (xtime.tv_sec) 
 - snapshot current time 
 - check user-supplied pointer for validity 
 - copy time to user space (asm/uaccess.hput_user) 
 - return time snapshot or error
 
53 
 54Example System Calls sys_reboot
- kernel/sys.c  sys_reboot 
 - require SYS_BOOT capability 
 - check magic numbers (0xfee1dead, Torvalds 
family birthdays)  - acquire the big kernel lock 
 - switch options 
 - shutdown in various ways restart, halt, poweroff 
 - user-specified shutdown command for some 
architectures  - toggle control-alt-delete processing 
 - go through reboot_notifier callbacks as 
appropriate  - unlock and return error if failure
 
54 
 55Example System Calls sys_sysinfo
- kernel/info.c  sys_sysinfo 
 - allocate a local struct to return info to user 
space  - disable (clear) interrupts to keep info 
consistent  - calculate uptime 
 - calculate 1, 5, 15 second load averages 
 - average length of run queue over interval 
 - use confusing int math to avoid floating-point 
inefficiency  - enable (set) interrupts 
 - return number of processes and some mem stats 
 - copy local struct values to user space 
(copy_to_user) 
55 
 56 Context switch in Linux 
 57Memory layout  general picture
Process Y user memory
Process X user memory
Process Z user memory
Kernel memory 
 581  kernel stack after any system call, before 
context switch
prev
ss
esp
eflags
cs
eip
orig_eax
es
ds
eax
ebp
edi
esi
edx
ecx
ebx
User Stack
User Code
TSS
tss-gtesp0
Schedule() function frame
Saved on the kernel stack during a transition to 
kernel mode by a jump to interrupt and by 
SAVE_ALL macro
task_struct
thread.esp0 
 592  stack of prev before switch_to macro in 
schedule() func
prev
Schedule() saved EAX, ECX, EDX
Arguments to contex_switch()
Return address to schedule() 
TSS
Old (schedules()) EBP 
tss-gtesp0 
 603  switch_to save esi, edi, ebp on the stack 
of prev
prev
Schedule() saved EAX, ECX, EDX
Arguments to contex_switch()
Return address to schedule() 
TSS
Old (schedules()) EBP 
tss-gtesp0
ESI
EDI
EBP 
 614  switch_to save esp in prev-gtthread.esp
prev
Schedule() saved EAX, ECX, EDX
Arguments to contex_switch()
Return address to schedule() 
TSS
Old (schedules()) EBP 
tss-gtesp0
ESI
EDI
EBP 
 625  switch_to load next-gtthread.esp into esp
next
prev
Schedule() saved EAX, ECX, EDX
Schedule() saved EAX, ECX, EDX
Arguments to contex_switch()
Arguments to contex_switch()
Return address to schedule() 
Return address to schedule() 
TSS
Old (schedules()) EBP 
Old (schedules()) EBP 
tss-gtesp0
ESI
ESI
EDI
EDI
EBP
EBP
esp
task_struct
thread.eip
1f
thread.esp
thread.esp0 
 636  switch_to save return address in the 
prev-gtthread.eip
next
prev
Schedule() saved EAX, ECX, EDX
Schedule() saved EAX, ECX, EDX
Arguments to contex_switch()
Arguments to contex_switch()
Return address to schedule() 
Return address to schedule() 
TSS
Old (schedules()) EBP 
Old (schedules()) EBP 
tss-gtesp0
ESI
ESI
EDI
EDI
EBP
EBP
esp
task_struct
thread.eip
1f
1f
thread.esp
thread.esp0 
 647  switch_to save return address on the stack 
of next
next
prev
Schedule() saved EAX, ECX, EDX
Schedule() saved EAX, ECX, EDX
Arguments to contex_switch()
Arguments to contex_switch()
Return address to schedule() 
Return address to schedule() 
TSS
Old (schedules()) EBP 
Old (schedules()) EBP 
tss-gtesp0
ESI
ESI
EDI
EDI
EBP
EBP
1f
esp
task_struct
thread.eip
1f
1f
thread.esp
thread.esp0 
 658  __switch_to func save the base of nexts 
stack in TSS
next
prev
Schedule() saved EAX, ECX, EDX
Schedule() saved EAX, ECX, EDX
Arguments to contex_switch()
Arguments to contex_switch()
Return address to schedule() 
Return address to schedule() 
Old (schedules()) EBP 
Old (schedules()) EBP 
ESI
ESI
EDI
EDI
EBP
EBP
1f
esp
task_struct
thread.eip
1f
1f
thread.esp
thread.esp0 
 669  back in switch_to eip points to 1f 
instruction label
next
prev
Schedule() saved EAX, ECX, EDX
Schedule() saved EAX, ECX, EDX
Arguments to contex_switch()
Arguments to contex_switch()
Return address to schedule() 
Return address to schedule() 
Old (schedules()) EBP 
Old (schedules()) EBP 
ESI
ESI
EDI
EDI
EBP
EBP
esp
task_struct
thread.eip
1f
1f
thread.esp
thread.esp0 
 6710  switch_to restore esi, edi, ebp from the 
stack of next
next
prev
Schedule() saved EAX, ECX, EDX
Schedule() saved EAX, ECX, EDX
Arguments to contex_switch()
Arguments to contex_switch()
Return address to schedule() 
Return address to schedule() 
Old (schedules()) EBP 
Old (schedules()) EBP 
esp
ESI
EDI
EBP
task_struct
thread.eip
1f
1f
thread.esp
thread.esp0 
 68Thank you