Architecture: Snapshotting

Snapshotting is a highly experimental x86_64 only feature currently under development. It is 100% not supported and only supports a very limited set of devices. This page roughly summarizes how the system works, and how device authors should think about it when writing new devices.

The snapshot & restore sequence

The data required for a snapshot is stored in several places, including guest memory, and the devices running on the host. To take an accurate snapshot, we need a point in time snapshot. Since there is no way to fetch this state atomically, we have to freeze the guest (VCPUs) and the device backends. Similarly, on restore we must freeze in the same way to prevent partially restored state from being modified.

Snapshotting a running VM

In code, this is implemented by vm_control::do_snapshot. We always freeze the VCPUs first (vm_control::VcpuSuspendGuard). This is done so that we can flush all pending interrupts to the irqchip (LAPIC) without triggering further activity from the driver (which could in turn trigger more device activity). With the VCPUs frozen, we freeze devices (vm_control::DeviceSleepGuard). From here, it's a just a matter of serializing VCPU state, guest memory, and device state.

A word about interrupts

Interrupts come in two primary flavors from the snapshotting perspective: legacy interrupts (e.g. IOAPIC interrupt lines), and MSIs.

Legacy interrupts

These are a little tricky because they are allocated as part of device creation, and device creation happens before we snapshot or restore. To avoid actually having to snapshot or restore the Event object wiring for these interrupts, we rely on the fact that as long as the VM is created with the right shape (e.g. devices), the interrupt Events will be wired between the device & the irqchip correctly. As part of restoring, we will set the routing table, which ensures that those events map to the right GSIs in the hypervisor.

MSIs

These are much simpler, because of how MSIs are implemented in CrosVM. In MsixConfig, we save the MSI routing information for every IRQ. At restore time, we just register these MSIs with the hypervisor using the exact same mechanism that would be invoked on device activation (albeit bypassing GSI allocation since we know from the saved state exactly which GSI must be used).

Flushing IRQs to the irqchip

IRQs sometimes pass through multiple host Events before reaching the hypervisor (or VCPU loop) for injection. Rather than trying to snapshot the Event state, we freeze all interrupt sources (devices) and flush all pending interrupts into the irqchip. This way, snapshotting the irqchip state is sufficient to capture all pending interrupts.

Two-step snapshotting

Two-step snapshotting is performed in crosvm to ensure data retention.

Problem definition:

  1. VMM Manager requests crosvm to suspend.
  2. Crosvm suspends, however host-side processes are still running.
  3. VMM Manager requests processes suspend.
  4. VMM Manager requests snapshot from crosvm.
  5. VMM Manager snapshots host-side processes.
  6. VMM Manager requests host-side processes and crosvm to resume (or stop).

The problem is that data may be lost in steps 4 & 5, because of the time between steps 2 & 3. After step 2, crosvm is suspended and host-side processes are still running, which means host-side processes may send data to crosvm but the device in crosvm has not read that data.

When the VM resumes, there are no issues, as the data gets read and processing continues normally. However, when the VM restores, that data is lost as it was not saved.

Solution is two-step snapshotting. We modify step 4 to read any data coming from the host just before snapshotting, to save that data in crosvm, and then process that data when the VM resumes.

Restoring a VM in lieu of booting

Restoring on to a running VM is not supported, and may never be. Our preferred approach is to instead create a new VM from a snapshot. This is why vm_control::do_restore can be invoked as part of the VM creation process.

Implications for device authors

New devices SHOULD be compatible with the devices::Suspendable trait, but MAY defer actual implementation to the future. This trait's implementation defines how the device will sleep/wake, and how its state will be saved & restored as part of snapshotting.

New virtio devices SHOULD implement the virtio device snapshot methods on VirtioDevice: virtio_sleep, virtio_wake, virtio_snapshot, and virtio_restore.