PowerPC Book E MMU architecture
This is just a high-level overview, which glosses over some details of the MMU. For the full specification, please see the Power Instruction Set Architecture.
PowerPC Book E has three address spaces: Effective, Virtual, and Real, which roughly correspond to Logical, Linear, and Physical in Intel x86 terminology. While x86 processors translate 32-bit logical addresses to 32-bit linear addresses using segmentation, Book E processors extend the 32-bit effective address into a 41-bit virtual address using data from supervisor-only registers.
This extension means that the virtual address space is 41 bits wide, and it is virtual addresses that the TLB translates to real addresses. Since user tasks have 32-bit (effective) address spaces, MMU mappings for multiple tasks may be present in the TLB simultaneously, which means that the TLB need not be flushed on every task switch. For example, effective address 0x0 in two different tasks may be virtual addresses 0x100000000 and 0x200000000, simply because the contents of PID register is different for each task. (Note: the PID register is only 8 bits and has no direct connection to the OS concept of a "process ID".)
Isolating the Guest Address Space
The essential problem is this: guests will attempt to program the TLB with a full 41-bit address space, yet to fit within a single task on the host we must somehow compress this into a 32-bit address space.
However, there is a shortcut available: currently, Linux does not use the AS bit at all. So Linux's virtual address space looks like this (where every PID box represents a 32-bit address space):
That means we can use the unused 40-bit AS=1 address space for guest mappings! Essentially, every time a Linux guest creates a 40-bit mapping with AS=0, we can force AS=1:
Accessing guest memory
When an interrupt occurs, the AS bit is set to 0 by hardware, which is fine because our exception handlers are part of the host address space. However, it is difficult to access memory from the other address space, and in the case of an illegal instruction trap we must copy the errant instruction from the guest (AS 1) to the host (AS 0) in order to decode and emulate it. We will need a small piece of assembly:
- Set MSR[DS]=1 (leaving MSR[IS]=0).
- Load from the specified address into a register.
- Set MSR[DS]=0.
For a single 32-bit instruction this is ok (probably slow), but this will become very difficult if we have large amounts of guest memory to copy into the host. We can try to use as many registers as possible to transfer data back and forth, but anything over ~512 bytes will require multiple iterations through this sequence, which will probably be very slow.
Guests using AS 1
Of course, additional work is needed for guests that use both AS 0 and AS 1, and there is at least one commercial RTOS that does this. However, we can still take advantage of the unused address space because we know that the host isn't using it. If a guest uses AS 0 to map the kernel and AS 1 for user space (which is the intended purpose of these address spaces), we can:
- Record both sets of mappings separately.
- Only insert one set of mappings into the hardware TLB at a time.
- When gMSR[IS] != gMSR[DS], the IS bit will have to take precendence in installing the TLB mappings. load/stores will have to be emulated one by one.
- Swap sets every time the gMSR[IS/DS] bit changes.
We cannot allow guest mappings with both gAS=0 and gAS=1 to be present in the hardware TLB at the same time, since at best this breaks isolation within the guests, and at worst it creates conflicting TLB mappings in the hardware.
For an initial implementation, we can focus on Linux as the guest and defer the mass TLB swapping.
Virtualizing the TLB
Book E MMUs do not use a hardware table walk. Instead, there is a software-controlled TLB, containing e.g. 64 entries. These entries are directly modifiable by software via the
tlbwe instruction, which specifies a TLB entry to replace by its index.
Unlike classic or server PowerPC architecture, the Book E MMU is not disabled when an interrupt occurs. The exception handlers must always be mapped (i.e. there must be a TLB entry present for them). In a hosted environment, the exception handlers must belong to the host, which means that there must be at least one entry in the hardware TLB that is a host mapping.
This conflicts with
tlbwe, since the guest can specify exactly which TLB entry to overwrite. Accordingly, we must "borrow" a TLB entry for the host exception handlers, and opaquely fix up any faults that replacement incurs. For example, if the host uses TLB entry 0 for the exception handlers, and the guest executes
tlbwe specifying entry 0, we cannot insert that mapping into the TLB. Later on, if the guest touches memory that would have been mapped by the missing entry, the host must satisfy the fault itself.
Ideally, we could indicate to the guest that the size of its TLB is smaller than the hardware supports, so that it will not try to use all TLB entries. Freescale processors are already somewhat flexible in this regard, since they have cores with non-power-of-2 TLB sizes. However, for the purposes of full virtualization, the host must emulate the native TLB size.
The algorithm to select which TLB entry to borrow is critical. If, for example, the host borrows a 256MB mapping of frequently accessed memory, performance will suffer dramatically. Accordingly, the host probably should avoid borrowing large page mappings. The host may want to borrow entries on a rotating basis, rather than hardcoding a particular guest entry to replace. The host also needs to keep track of the true guest TLB state in order to emulate
tlbre (which reads a TLB entry by index).
Guest <-> Host Transitions
When an interrupt occurs while the guest is running, returning control to the host, the hardware will almost completely contain guest state. That obviously includes guest register state, but also TLB entries. Accordingly, the exception handler code must not only save/restore guest and host register state, but often it must also save/restore the TLB before calling into the rest of the host kernel.
Possible optimization: if the host kernel is mapped by a single large page, KVM's interrupt handlers could restore just that single TLB entry before calling out into the host. The rest of the TLB would still contain AS=1 mappings, which would be ignored. However, if the host kernel decides to context switch to another host process, we will need a hook so that KVM can restore the rest of the host TLB state. This would be a win if most guest->host transitions do not block, which should be true with most instruction emulation and interrupt reflection into the guest.