Virtual Machines

Intro / Flashback (05:12)

Go Back

Virtual Machines (03:24)

Go Back

Virtual Machines

  • System Virtual Machines
  • Goal: Imitate hardware
  • What hardware?
    • Whatever's easiest
  • Running Operating Systems
  • Terms
    • Hypervisor (or "VM Monitor")
      • e.g. Virtualbox
      • Sometimes a part of host OS
    • Guest OS
      • e.g Linux
      • Application for the Hypervisor
    • Host OS
      • e.g. Windows
      • Sometimes Host OS = Hypervisor
  • We'll assume Hypervisor = Host OS
    • e.g. Hypervisor will have exception handlers

Next

Imitate: How close? (02:08)

Go Back

Imitate: How close?

  • Full Virtualization
    • Guest OS unmodified
  • Paravirtualization
    • Small changes to guest OS
  • Fuzzy line

Other techniques (01:16)

Go Back

Other techniques

  1. Extra HW support for VMS
  2. Compile guest OS' machine code to new machine code
    • Not as slow as you might think

VM Layering (00:34)

Go Back

VM Layering

  • Run Hypervisor in Kernel mode
  • Run Guest OS in User mode
    • Can think of Guest OS as Hypervisor's "process"

Pretend Modes (00:41)

Go Back

Pretend Modes

  • Pretend User mode and Pretend kernel mode for guest OS
    • Needs to run stuff in Kernel mode
    • But whole thing in user mode

Keeping track of stuff (01:36)

Go Back

Keeping Track of stuff

  • Hypervisor keeps track of regular process stuff
    • Some extras tho
      • Whether in User/kernel mode
      • Page table ptr of guest OS (HARD)
      • Exception table ptr
      • Whether interrupts are disabled
        • (whether we're pretending they are or not)
  • Guest OS runs like a process, with extra state to keep track of

Basic Hypervisor Flow (02:32)

Go Back

How do we make it look like it's on real HW?

  • Run it normally until it stops (exceptions)
  • Hypervisor simulates what processor would "normally" do
    • e.g. Screen system call: talks to screen on your behalf
    • e.g. fault to disable interrupts: Store that you disabled interrupts
    • e.g. System call: Run Guest OS' system call handler

Virtual machine execution pieces (00:50)

Go Back

Virtal Machine execution pieces

  1. Making IO and kernel-mode related instructions work
    • Trap and emulate
    • Force instruction to cause fault
    • Make fault handler do what HW would do
    • might require reading machine code to emulate instruction
  2. Making exceptions/interrupts work
    • "reflect" exceptions/interrupts into guest OS
  3. Making page tables work
    • Whole thing

Trap and Emulate (00:56)

Go Back

Trap and emulate

  • Normally: Privileged instrs trigger fault
  • Normal OS: crash the program
  • Instead: Hypervisor pretends it did the right thing
    • We'll do the privileged thing on your behalf

Privileged I/O flow (00:59)

Go Back

Privileged I/O flow

  1. Guest OS tries to access device
    • Not allowed: Guest OS in user mode
  2. Protection fault triggered
  3. Handler in hypervisor
    • "Oh you were trying to get keyboard input"
    • talk to device for you
    • Update guest OS w/ any results
    • Switch back

What this looks like (pseudocode) (05:04)

Go Back

How this looks

  • Handler actually looks at the faulting instruction
  • Also check what mode it thinks it's in
  • Straightforward but tedious
    • Why I/O isn't super fast in VMs

What if we're in pretend user mode? (02:00)

Go Back

What if we're in pretend user mode?

Read from keyboard

  • If in pretend kernel mode: Do the actual privileged instruction
  • If in pretend user mode: Invoke guest OS' exception handler (after switching to fake kernel mode, etc)
    • How nested VMs work
    • (pass handling down the VM chain)

System calls (03:07)

Go Back

System calls

  1. Program in Guest OS makes system call
  2. Invokes handler in hypervisor
  3. Hypervisor forwards system call to the guest OS' system call handler
    • Also mark fake kernel mode, change PC, etc.
    • Also have to update page table (more on this later)

After system call

  • Return from exception in guest OS' syscall handler
    • Privileged instruction --> Protection fault cuz pretend kernel mode isn't real kernel mode
    • hypervisor has to emulate the actual return to the user program

How this works for Memory-Mapped I/O (02:22)

Go Back

Applying this I/O to memory-mapped I/O

  • Emulating writing to a control register
    • Have to emulate every memory-writing instruction to make devices work
    • (at least) 2 types of page faults for hypervisor:
      1. Guest OS trying to access device memory (emulate the device)
      2. Guest OS trying to access memory not in its page table (run exception handler in guest OS)
        • Trigger page fault in guest OS
    • tl;dr: Is it not mapped because it's a device, or because we don't have this mapping in the guest OS?
    • Virtual mem on top of this? Extra cases (next)

Exercise: How many exceptions? (02:37)

Go Back

Exceptions that occur:

  1. System call
  2. Write 1
  3. Write 2
  4. Write 3
  5. Write 4
  6. Return from system call

  7. More if page faults to bring in memory, etc

Making this faster

  • Paravirtualization

This doesn't always work... (01:29)

Go Back

This often doesn't actually work

  • Works on some architectures
    • Relies on all special instrs triggering fault
    • Not all in x86
      • Some behave differently in User and kernel mode, but not just fault and no fault
    • Why original VMWare used to compile guest OS' machine code to other machine code
  • Modern x86 added some ISA extensions that solve this problem

What about virtual memory? (01:01)

Go Back

Things Virtual Memory needs

  • Change page table from user to kernel mode
    • Not automatic cuz actually all in user mode
  • Virtual memory most complicated part of virtualization
    • also source of slowdowns

Terms (01:21)

Go Back

Terms

  • Virtual address: Fake virtual addr for guest OS
  • Physical address: Fake physical addr for guest OS
    • Virtual address for hypervisor
  • Machine address: Real physical address for hypervisor / host OS

Three page tables (05:01)

Go Back

Three page tables

  1. Guest page table
    • Hypervisor records where it is, etc
      • Privileged instruction sets base ptr
    • Translates virtual to physical
  2. Hypervisor page table?
    • Translates physical to machine
    • Might not really be a page table
    • We do need to know this mapping thok
    • Need to have a mapping
      • Where would you store hypervisor?
      • What if you wanna run multiple VMs?

But guest OS needs translation from virtual to machine directly

  1. Shadow page table

    • Virtual to machine translation
    • "Shadow" of guest page table
    • Hypervisor constructs this from guest page table and its own page table
  2. Guest OS only knows about guest page table

  3. Hardware only knows about shadow page table

Creating the shadow page table (07:44)

Go Back

Creating the shadow page table

  • Simple
  • 2 page table lookups:
    • guest PT lookup
    • hypervisor ~PT lookup
  • Combine the two

When do we update the shadow page table?

  • Needs to be up to date
  • What do you have to do when you update the page table?
    • Flush the TLB (Translation Lookaside Buffer)
      • A privileged instruction!
    • Processor actually uses TLB to do address translation w/ normal PT's
      • Same problem
        • Sythesized from a page table
        • Gets out of date
        • Needs to be told when to update
      • So can use same strategy
        • Manage the Shadow PT the same way the hardware manages the TLB
        • Shadow page table is basically a "virtual TLB" (stored as a PT instead of a cache)
      • Usual strategy
        • Want addr translated
        • actually go through TLB
        • Fetch addr from PT if not in TLB (Fetch on demand)
          • Shadow PT: fetch on page fault
        • When PT is modified
          • OS invalidates TLB entry (or flush them all)
          • This is the privileged instruction we'll take advantage of
      • Our strategyk
        • Guest OS edits guest PT
          • triggers instruction to flush TLB
          • Hypervisor clears part of shadow page table (fill in later if page fault)
        • Guest page faults
          • Hypervisor automatically detches on demand if necessary (like how HW does this automatically)
          • Hypervisor does conversion from 2 PTs
  • More on shadow page table
    • Caches commonly used PTEs in translated form

nit: memory-mapped I/O (00:45)

Go Back

nit: memory-mapped I/O

  • If physical addr is for fake I/O device
    • Make shadow PTE invalid
    • We want page fault for these to emulate access to the device

Page tables and kernel mode? (02:41)

Go Back

Page tables and kernel mode

  • remember: Can mark pages as kernel-only
  • Hypervisor needs to make this work
  • If guest OS in pretend kernel mode
    • Sjhadow PTE: marked as user-mode accessible (really in user mode)
  • If guest OS in pretend user mode
    • Shadow PTE: marked inaccessible

One solution: Two shadow page tables

  • One for pretend user mode
  • One for pretend kernel mode
  • Switch between them on exceptions etc
  • Higher cost of emulation (switching page tables every time)

Alternate solution: clear PT on kernel/user switch

  • Also not great for overhead

Exercise (06:52)

Go Back

Guest PT switches

  1. Switch to another program
  2. Switch back to original program

Shadow PT switches

  1. read(): To kernel mode (assume it doesn't switch back yet)
  2. switch(): progA to progB
  3. switch(): Back to user mode
  4. keyboard(): To kernel mode (assume it doesn't switch back yet)
  5. switch(): progB to progA
  6. switch(): Back to user mode

  7. 32-bit x86: Every time you change PT ptr or invalide TLB, you gotta flush the whole TLB

    • So, all this gets really expensive

Tagged TLBs (01:39)

Go Back

Tagged TLBs

  • HW sometimes includes "address space ID" in TLB entries
    • kinda a process ID
  • Helpful for normal OSes
    • faster context switching
  • Super useful for hypervisor
    • Lots of switches
    • less expensive per switch
  • not used by modern OSes until recently

Proactively filling page tables (06:52)

Go Back

problem with filling on demand

  • Many OSes invalidate entire TLB on context switch
    • Especially without tagged TLBs
  • So, rebuild shadow PT on every OS context switch
    • Often unacceptably slow
    • Want to cache shadow page tables
    • problem: OS won't tell you when it's writing

Solution: Shadow page table for multiple processes

  • Actually 2 for each one (kernel and user)
  • Problem: What if guest OS modifies another process' page table (e.g. fork, copy on write, evicting pages)
    • Guest OS thinks TLB only stores current process
      • So won't tell the hypervisor of these changes
  • Solution: Trap and emulate
    • Track what physical pages are part of PTs the guest OS knows about
    • Mark them as read-only in shadow PTs
    • When guest OS tries to modify, triggers protection fault
    • Sounds really expensive, but cheaper than updating shadow PT every time
    • Real VM monitors do this

Pros and cons (02:34)

Go Back

proactive (trap and emulate) over on-demand

  • pro: works with guest OSes that make assumptions about TLB size
  • pro: maintain shadow Pt for each guest process (avoids rebuilding on every context switch)
  • pro: better fit with tagged TLBs
  • con: more instructions spent doing copy-on-write
  • con: What happens when PT memory recycled?
  • con: Super complicated

Hardware Hypervisor support (02:13)

Go Back

Hardware Hypervisor support (not on exam)

  • Common in modern processors
  • Benefits:
    • Lets processor be in kernel mode but still switch to VM monitor when something important happens
      • Processor will track User/Kernel mode PTEs for you (don't need separate shadow PT's for these)
    • Lets you configure when to run hypervisor
      • HW can run guest handlers, and specify which things go to hypervisor
    • Nested page table support
      • HW will do 2 page table lookups itself (to help with Virtual Machines)
      • Even tho this is super complicated in the HW
  • Why VMs are much faster today