Discussion:
[libvirt-users] Nested KVM: L0 guest produces kernel BUG on wakeup from managed save (while a nested VM is running)
Florian Haas
2018-02-06 15:11:46 UTC
Permalink
Hi everyone,

I hope this is the correct list to discuss this issue; please feel
free to redirect me otherwise.

I have a nested virtualization setup that looks as follows:

- Host: Ubuntu 16.04, kernel 4.4.0 (an OpenStack Nova compute node)
- L0 guest: openSUSE Leap 42.3, kernel 4.4.104-39-default
- Nested guest: SLES 12, kernel 3.12.28-4-default

The nested guest is configured with "<type arch='x86_64'
machine='pc-i440fx-1.4'>hvm</type>".

This is working just beautifully, except when the L0 guest wakes up
from managed save (openstack server resume in OpenStack parlance).
Then, in the L0 guest we immediately see this:

[Tue Feb 6 07:00:37 2018] ------------[ cut here ]------------
[Tue Feb 6 07:00:37 2018] kernel BUG at ../arch/x86/kvm/x86.c:328!
[Tue Feb 6 07:00:37 2018] invalid opcode: 0000 [#1] SMP
[Tue Feb 6 07:00:37 2018] Modules linked in: fuse vhost_net vhost
macvtap macvlan xt_CHECKSUM iptable_mangle ipt_MASQUERADE
nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat
nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT
nf_reject_ipv4 xt_tcpudp tun br_netfilter bridge stp llc
ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter
ip_tables x_tables vboxpci(O) vboxnetadp(O) vboxnetflt(O) af_packet
iscsi_ibft iscsi_boot_sysfs vboxdrv(O) kvm_intel kvm irqbypass
crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel
hid_generic usbhid jitterentropy_rng drbg ansi_cprng ppdev parport_pc
floppy parport joydev aesni_intel processor button aes_x86_64
virtio_balloon virtio_net lrw gf128mul glue_helper pcspkr serio_raw
ablk_helper cryptd i2c_piix4 ext4 crc16 jbd2 mbcache ata_generic
[Tue Feb 6 07:00:37 2018] virtio_blk ata_piix ahci libahci cirrus(O)
drm_kms_helper(O) syscopyarea sysfillrect sysimgblt fb_sys_fops ttm(O)
drm(O) virtio_pci virtio_ring virtio uhci_hcd ehci_hcd usbcore
usb_common libata sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc
scsi_dh_alua scsi_mod autofs4
[Tue Feb 6 07:00:37 2018] CPU: 2 PID: 2041 Comm: CPU 0/KVM Tainted: G
W O 4.4.104-39-default #1
[Tue Feb 6 07:00:37 2018] Hardware name: OpenStack Foundation
OpenStack Nova, BIOS 1.10.1-1ubuntu1~cloud0 04/01/2014
[Tue Feb 6 07:00:37 2018] task: ffff880037108d80 ti: ffff88042e964000
task.ti: ffff88042e964000
[Tue Feb 6 07:00:37 2018] RIP: 0010:[<ffffffffa04f20e5>]
[<ffffffffa04f20e5>] kvm_spurious_fault+0x5/0x10 [kvm]
[Tue Feb 6 07:00:37 2018] RSP: 0018:ffff88042e967d70 EFLAGS: 00010246
[Tue Feb 6 07:00:37 2018] RAX: 0000000000000000 RBX: ffff88042c4f0040
RCX: 0000000000000000
[Tue Feb 6 07:00:37 2018] RDX: 0000000000006820 RSI: 0000000000000282
RDI: ffff88042c4f0040
[Tue Feb 6 07:00:37 2018] RBP: ffff88042c4f00d8 R08: ffff88042e964000
R09: 0000000000000002
[Tue Feb 6 07:00:37 2018] R10: 0000000000000004 R11: 0000000000000000
R12: 0000000000000001
[Tue Feb 6 07:00:37 2018] R13: 0000021d34fbb21d R14: 0000000000000001
R15: 000055d2157cf840
[Tue Feb 6 07:00:37 2018] FS: 00007f7c52b96700(0000)
GS:ffff88043fd00000(0000) knlGS:0000000000000000
[Tue Feb 6 07:00:37 2018] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Tue Feb 6 07:00:37 2018] CR2: 00007f823b15f000 CR3: 0000000429334000
CR4: 0000000000362670
[Tue Feb 6 07:00:37 2018] DR0: 0000000000000000 DR1: 0000000000000000
DR2: 0000000000000000
[Tue Feb 6 07:00:37 2018] DR3: 0000000000000000 DR6: 00000000fffe0ff0
DR7: 0000000000000400
[Tue Feb 6 07:00:37 2018] Stack:
[Tue Feb 6 07:00:37 2018] ffffffffa07939b1 ffffffffa0787875
ffffffffa0503a60 ffff88042c4f0040
[Tue Feb 6 07:00:37 2018] ffffffffa04e5ede ffff88042c4f0040
ffffffffa04e6f0f ffff880037108d80
[Tue Feb 6 07:00:37 2018] ffff88042c4f00e0 ffff88042c4f00e0
ffff88042c4f0040 ffff88042e968000
[Tue Feb 6 07:00:37 2018] Call Trace:
[Tue Feb 6 07:00:37 2018] [<ffffffffa07939b1>]
intel_pmu_set_msr+0xfc1/0x2341 [kvm_intel]
[Tue Feb 6 07:00:37 2018] DWARF2 unwinder stuck at
intel_pmu_set_msr+0xfc1/0x2341 [kvm_intel]
[Tue Feb 6 07:00:37 2018] Leftover inexact backtrace:
[Tue Feb 6 07:00:37 2018] [<ffffffffa0787875>] ?
vmx_interrupt_allowed+0x15/0x30 [kvm_intel]
[Tue Feb 6 07:00:37 2018] [<ffffffffa0503a60>] ?
kvm_arch_vcpu_runnable+0xa0/0xd0 [kvm]
[Tue Feb 6 07:00:37 2018] [<ffffffffa04e5ede>] ?
kvm_vcpu_check_block+0xe/0x60 [kvm]
[Tue Feb 6 07:00:37 2018] [<ffffffffa04e6f0f>] ?
kvm_vcpu_block+0x8f/0x310 [kvm]
[Tue Feb 6 07:00:37 2018] [<ffffffffa0503c17>] ?
kvm_arch_vcpu_ioctl_run+0x187/0x400 [kvm]
[Tue Feb 6 07:00:37 2018] [<ffffffffa04ea6d9>] ?
kvm_vcpu_ioctl+0x359/0x680 [kvm]
[Tue Feb 6 07:00:37 2018] [<ffffffff81016689>] ? __switch_to+0x1c9/0x460
[Tue Feb 6 07:00:37 2018] [<ffffffff81224f02>] ? do_vfs_ioctl+0x322/0x5d0
[Tue Feb 6 07:00:37 2018] [<ffffffff811362ef>] ?
__audit_syscall_entry+0xaf/0x100
[Tue Feb 6 07:00:37 2018] [<ffffffff8100383b>] ?
syscall_trace_enter_phase1+0x15b/0x170
[Tue Feb 6 07:00:37 2018] [<ffffffff81225224>] ? SyS_ioctl+0x74/0x80
[Tue Feb 6 07:00:37 2018] [<ffffffff81634a02>] ?
entry_SYSCALL_64_fastpath+0x16/0xae
[Tue Feb 6 07:00:37 2018] Code: d7 fe ff ff 8b 2d 04 6e 06 00 e9 c2
fe ff ff 48 89 f2 e9 65 ff ff ff 0f 1f 44 00 00 66 2e 0f 1f 84 00 00
00 00 00 0f 1f 44 00 00 <0f> 0b 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00
00 55 89 ff 48 89
[Tue Feb 6 07:00:37 2018] RIP [<ffffffffa04f20e5>]
kvm_spurious_fault+0x5/0x10 [kvm]
[Tue Feb 6 07:00:37 2018] RSP <ffff88042e967d70>
[Tue Feb 6 07:00:37 2018] ---[ end trace e15c567f77920049 ]---

We only hit this kernel bug if we have a nested VM running. The exact
same setup, sent into managed save after shutting down the nested VM,
wakes up just fine.

Now I am aware of https://bugzilla.redhat.com/show_bug.cgi?id=1076294,
which talks about live migration — but I think the same considerations
apply.

I am also aware of
https://fedoraproject.org/wiki/How_to_enable_nested_virtualization_in_KVM,
which strongly suggests to use host-passthrough or host-model. I have
tried both, to no avail. The stack trace persists. I have also tried
running a 4.15 kernel in the L0 guest, from
https://kernel.opensuse.org/packages/stable, but again, the stack
trace persists.

What does fix things, of course, is to switch from the nested guest
from KVM to Qemu — but that also makes things significantly slower.

So I'm wondering: is there someone reading this who does run nested
KVM and has managed to successfully live-migrate or managed-save? If
so, would you be able to share a working host kernel / L0 guest kernel
/ nested guest kernel combination, or any other hints for tuning the
L0 guest to support managed save and live migration?

I'd be extraordinarily grateful for any suggestions. Thanks!

Cheers,
Florian
Kashyap Chamarthy
2018-02-07 15:31:08 UTC
Permalink
[Cc: KVM upstream list.]
Post by Florian Haas
Hi everyone,
I hope this is the correct list to discuss this issue; please feel
free to redirect me otherwise.
- Host: Ubuntu 16.04, kernel 4.4.0 (an OpenStack Nova compute node)
- L0 guest: openSUSE Leap 42.3, kernel 4.4.104-39-default
- Nested guest: SLES 12, kernel 3.12.28-4-default
The nested guest is configured with "<type arch='x86_64'
machine='pc-i440fx-1.4'>hvm</type>".
This is working just beautifully, except when the L0 guest wakes up
from managed save (openstack server resume in OpenStack parlance).
[...] # Snip the call trace from Florian. It is here:
https://www.redhat.com/archives/libvirt-users/2018-February/msg00014.html
Post by Florian Haas
What does fix things, of course, is to switch from the nested guest
from KVM to Qemu — but that also makes things significantly slower.
So I'm wondering: is there someone reading this who does run nested
KVM and has managed to successfully live-migrate or managed-save? If
so, would you be able to share a working host kernel / L0 guest kernel
/ nested guest kernel combination, or any other hints for tuning the
L0 guest to support managed save and live migration?
Following up from our IRC discussion (on #kvm, Freenode). Re-posting my
comment here:

So I just did a test of 'managedsave' (which is just "save the state of
the running VM to a file" in libvirt parlance) of L1, _while_ L2 is
running, and I seem to reproduce your case (see the call trace
attached).

# Ensure L2 (the nested guest) is running on L1. Then, from L0, do
# the following:
[L0] $ virsh managedsave L1
[L0] $ virsh start L1 --console

Result: See the call trace attached to this bug. But L1 goes on to
start "fine", and L2 keeps running, too. But things start to seem
weird. As in: I try to safely, read-only mount the L2 disk image via
libguestfs (by setting export LIBGUESTFS_BACKEND=direct, which uses
direct QEMU): `guestfish --ro -a -i ./cirros.qcow2`. It throws the call
trace again on the L1 serial console. And the `guestfish` command just
sits there forever


- L0 (bare metal) Kernel: 4.13.13-300.fc27.x86_64+debug
- L1 (guest hypervisor) kernel: 4.11.10-300.fc26.x86_64
- L2 is a CirrOS 3.5 image

I can reproduce this at least 3 times, with the above versions.

I'm using libvirt 'host-passthrough' for CPU (meaning: '-cpu host' in
QEMU parlance) for both L1 and L2.

My L0 CPU is: Intel(R) Xeon(R) CPU E5-2609 v3 @ 1.90GHz.

Thoughts?

---

[/me wonders if I'll be asked to reproduce this with newest upstream
kernels.]

[...]
--
/kashyap
David Hildenbrand
2018-02-07 22:26:14 UTC
Permalink
Post by Kashyap Chamarthy
[Cc: KVM upstream list.]
Post by Florian Haas
Hi everyone,
I hope this is the correct list to discuss this issue; please feel
free to redirect me otherwise.
- Host: Ubuntu 16.04, kernel 4.4.0 (an OpenStack Nova compute node)
- L0 guest: openSUSE Leap 42.3, kernel 4.4.104-39-default
- Nested guest: SLES 12, kernel 3.12.28-4-default
The nested guest is configured with "<type arch='x86_64'
machine='pc-i440fx-1.4'>hvm</type>".
This is working just beautifully, except when the L0 guest wakes up
from managed save (openstack server resume in OpenStack parlance).
https://www.redhat.com/archives/libvirt-users/2018-February/msg00014.html
Post by Florian Haas
What does fix things, of course, is to switch from the nested guest
from KVM to Qemu — but that also makes things significantly slower.
So I'm wondering: is there someone reading this who does run nested
KVM and has managed to successfully live-migrate or managed-save? If
so, would you be able to share a working host kernel / L0 guest kernel
/ nested guest kernel combination, or any other hints for tuning the
L0 guest to support managed save and live migration?
Following up from our IRC discussion (on #kvm, Freenode). Re-posting my
So I just did a test of 'managedsave' (which is just "save the state of
the running VM to a file" in libvirt parlance) of L1, _while_ L2 is
running, and I seem to reproduce your case (see the call trace
attached).
# Ensure L2 (the nested guest) is running on L1. Then, from L0, do
[L0] $ virsh managedsave L1
[L0] $ virsh start L1 --console
Result: See the call trace attached to this bug. But L1 goes on to
start "fine", and L2 keeps running, too. But things start to seem
weird. As in: I try to safely, read-only mount the L2 disk image via
libguestfs (by setting export LIBGUESTFS_BACKEND=direct, which uses
direct QEMU): `guestfish --ro -a -i ./cirros.qcow2`. It throws the call
trace again on the L1 serial console. And the `guestfish` command just
sits there forever
- L0 (bare metal) Kernel: 4.13.13-300.fc27.x86_64+debug
- L1 (guest hypervisor) kernel: 4.11.10-300.fc26.x86_64
- L2 is a CirrOS 3.5 image
I can reproduce this at least 3 times, with the above versions.
I'm using libvirt 'host-passthrough' for CPU (meaning: '-cpu host' in
QEMU parlance) for both L1 and L2.
Thoughts?
Sounds like a similar problem as in
https://bugzilla.kernel.org/show_bug.cgi?id=198621

In short: there is no (live) migration support for nested VMX yet. So as
soon as your guest is using VMX itself ("nVMX"), this is not expected to
work.
--
Thanks,

David / dhildenb
Florian Haas
2018-02-08 08:19:17 UTC
Permalink
Post by David Hildenbrand
Post by Kashyap Chamarthy
[Cc: KVM upstream list.]
Post by Florian Haas
Hi everyone,
I hope this is the correct list to discuss this issue; please feel
free to redirect me otherwise.
- Host: Ubuntu 16.04, kernel 4.4.0 (an OpenStack Nova compute node)
- L0 guest: openSUSE Leap 42.3, kernel 4.4.104-39-default
- Nested guest: SLES 12, kernel 3.12.28-4-default
The nested guest is configured with "<type arch='x86_64'
machine='pc-i440fx-1.4'>hvm</type>".
This is working just beautifully, except when the L0 guest wakes up
from managed save (openstack server resume in OpenStack parlance).
https://www.redhat.com/archives/libvirt-users/2018-February/msg00014.html
Post by Florian Haas
What does fix things, of course, is to switch from the nested guest
from KVM to Qemu — but that also makes things significantly slower.
So I'm wondering: is there someone reading this who does run nested
KVM and has managed to successfully live-migrate or managed-save? If
so, would you be able to share a working host kernel / L0 guest kernel
/ nested guest kernel combination, or any other hints for tuning the
L0 guest to support managed save and live migration?
Following up from our IRC discussion (on #kvm, Freenode). Re-posting my
So I just did a test of 'managedsave' (which is just "save the state of
the running VM to a file" in libvirt parlance) of L1, _while_ L2 is
running, and I seem to reproduce your case (see the call trace
attached).
# Ensure L2 (the nested guest) is running on L1. Then, from L0, do
[L0] $ virsh managedsave L1
[L0] $ virsh start L1 --console
Result: See the call trace attached to this bug. But L1 goes on to
start "fine", and L2 keeps running, too. But things start to seem
weird. As in: I try to safely, read-only mount the L2 disk image via
libguestfs (by setting export LIBGUESTFS_BACKEND=direct, which uses
direct QEMU): `guestfish --ro -a -i ./cirros.qcow2`. It throws the call
trace again on the L1 serial console. And the `guestfish` command just
sits there forever
- L0 (bare metal) Kernel: 4.13.13-300.fc27.x86_64+debug
- L1 (guest hypervisor) kernel: 4.11.10-300.fc26.x86_64
- L2 is a CirrOS 3.5 image
I can reproduce this at least 3 times, with the above versions.
I'm using libvirt 'host-passthrough' for CPU (meaning: '-cpu host' in
QEMU parlance) for both L1 and L2.
Thoughts?
Sounds like a similar problem as in
https://bugzilla.kernel.org/show_bug.cgi?id=198621
In short: there is no (live) migration support for nested VMX yet. So as
soon as your guest is using VMX itself ("nVMX"), this is not expected to
work.
Hi David, thanks for getting back to us on this.

I see your point, except the issue Kashyap and I are describing does
not occur with live migration, it occurs with savevm/loadvm (virsh
managedsave/virsh start in libvirt terms, nova suspend/resume in
OpenStack lingo). And it's not immediately self-evident that the
limitations for the former also apply to the latter. Even for the live
migration limitation, I've been unsuccessful at finding documentation
that warns users to not attempt live migration when using nesting, and
this discussion sounds like a good opportunity for me to help fix
that.

Just to give an example,
https://www.redhat.com/en/blog/inception-how-usable-are-nested-kvm-guests
from just last September talks explicitly about how "guests can be
snapshot/resumed, migrated to other hypervisors and much more" in the
opening paragraph, and then talks at length about nested guests —
without ever pointing out that those very features aren't expected to
work for them. :)

So to clarify things, could you enumerate the currently known
limitations when enabling nesting? I'd be happy to summarize those and
add them to the linux-kvm.org FAQ so others are less likely to hit
their head on this issue. In particular:

- Is https://fedoraproject.org/wiki/How_to_enable_nested_virtualization_in_KVM
still accurate in that -cpu host (libvirt "host-passthrough") is the
strongly recommended configuration for the L2 guest?

- If so, are there any recommendations for how to configure the L1
guest with regard to CPU model?

- Is live migration with nested guests _always_ expected to break on
all architectures, and if not, which are safe?

- Idem, for savevm/loadvm?

- With regard to the problem that Kashyap and I (and Dennis, the
kernel.org bugzilla reporter) are describing, is this expected to work
any better on AMD CPUs? (All reports are on Intel)

- Do you expect nested virtualization functionality to be adversely
affected by KPTI and/or other Meltdown/Spectre mitigation patches?

Kashyap, can you think of any other limitations that would benefit
from improved documentation?

Cheers,
Florian
David Hildenbrand
2018-02-08 12:07:33 UTC
Permalink
Post by Florian Haas
Post by David Hildenbrand
In short: there is no (live) migration support for nested VMX yet. So as
soon as your guest is using VMX itself ("nVMX"), this is not expected to
work.
Hi David, thanks for getting back to us on this.
Hi Florian,

(sombeody please correct me if I'm wrong)
Post by Florian Haas
I see your point, except the issue Kashyap and I are describing does
not occur with live migration, it occurs with savevm/loadvm (virsh
managedsave/virsh start in libvirt terms, nova suspend/resume in
OpenStack lingo). And it's not immediately self-evident that the
limitations for the former also apply to the latter. Even for the live
migration limitation, I've been unsuccessful at finding documentation
that warns users to not attempt live migration when using nesting, and
this discussion sounds like a good opportunity for me to help fix
that.
Just to give an example,
https://www.redhat.com/en/blog/inception-how-usable-are-nested-kvm-guests
from just last September talks explicitly about how "guests can be
snapshot/resumed, migrated to other hypervisors and much more" in the
opening paragraph, and then talks at length about nested guests —
without ever pointing out that those very features aren't expected to
work for them. :)
Well, it still is a kernel parameter "nested" that is disabled by
default. So things should be expected to be shaky. :) While running
nested guests work usually fine, migrating a nested hypervisor is the
problem.

Especially see e.g.
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/virtualization_deployment_and_administration_guide/nested_virt

"However, note that nested virtualization is not supported or
recommended in production user environments, and is primarily intended
for development and testing. "
Post by Florian Haas
So to clarify things, could you enumerate the currently known
limitations when enabling nesting? I'd be happy to summarize those and
add them to the linux-kvm.org FAQ so others are less likely to hit
The general problem is that migration of an L1 will not work when it is
running L2, so when L1 is using VMX ("nVMX").

Migrating an L2 should work as before.

The problem is, in order for L1 to make use of VMX to run L2, we have to
run L2 in L0, simulating VMX -> nested VMX a.k.a. nVMX . This requires
additional state information about L1 ("nVMX" state), which is not
properly migrated when migrating L1. Therefore, after migration, the CPU
state of L1 might be screwed up after migration, resulting in L1 crashes.

In addition, certain VMX features might be missing on the target, which
also still has to be handled via the CPU model in the future.

L0, should hopefully not crash, I hope that you are not seeing that.
Post by Florian Haas
- Is https://fedoraproject.org/wiki/How_to_enable_nested_virtualization_in_KVM
still accurate in that -cpu host (libvirt "host-passthrough") is the
strongly recommended configuration for the L2 guest?
- If so, are there any recommendations for how to configure the L1
guest with regard to CPU model?
You have to indicate the VMX feature to your L1 ("nested hypervisor"),
that is usually automatically done by using the "host-passthrough" or
"host-model" value. If you're using a custom CPU model, you have to
enable it explicitly.
Post by Florian Haas
- Is live migration with nested guests _always_ expected to break on
all architectures, and if not, which are safe?
x86 VMX: running nested guests works, migrating nested hypervisors does
not work

x86 SVM: running nested guests works, migrating nested hypervisor does
not work (somebody correct me if I'm wrong)

s390x: running nested guests works, migrating nested hypervisors works

power: running nested guests works only via KVM-PR ("trap and emulate").
migrating nested hypervisors therefore works. But we are not using
hardware virtualization for L1->L2. (my latest status)

arm: running nested guests is in the works (my latest status), migration
is therefore also not possible.
Post by Florian Haas
- Idem, for savevm/loadvm?
savevm/loadvm is not expected to work correctly on an L1 if it is
running L2 guests. It should work on L2 however.
Post by Florian Haas
- With regard to the problem that Kashyap and I (and Dennis, the
kernel.org bugzilla reporter) are describing, is this expected to work
any better on AMD CPUs? (All reports are on Intel)
No, remeber that they are also still missing migration support of the
nested SVM state.
Post by Florian Haas
- Do you expect nested virtualization functionality to be adversely
affected by KPTI and/or other Meltdown/Spectre mitigation patches?
Not an expert on this. I think it should be affected in a similar way as
ordinary guests :)
Post by Florian Haas
Kashyap, can you think of any other limitations that would benefit
from improved documentation?
We should certainly document what I have summaries here properly at a
central palce!
Post by Florian Haas
Cheers,
Florian
--
Thanks,

David / dhildenb
Florian Haas
2018-02-08 13:29:46 UTC
Permalink
Hi David,

thanks for the added input! I'm taking the liberty to snip a few
paragraphs to trim this email down a bit.
Post by David Hildenbrand
Post by Florian Haas
Just to give an example,
https://www.redhat.com/en/blog/inception-how-usable-are-nested-kvm-guests
from just last September talks explicitly about how "guests can be
snapshot/resumed, migrated to other hypervisors and much more" in the
opening paragraph, and then talks at length about nested guests —
without ever pointing out that those very features aren't expected to
work for them. :)
Well, it still is a kernel parameter "nested" that is disabled by
default. So things should be expected to be shaky. :) While running
nested guests work usually fine, migrating a nested hypervisor is the
problem.
Especially see e.g.
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/virtualization_deployment_and_administration_guide/nested_virt
"However, note that nested virtualization is not supported or
recommended in production user environments, and is primarily intended
for development and testing. "
Sure, I do understand that Red Hat (or any other vendor) is taking no
support responsibility for this. At this point I'd just like to
contribute to a better understanding of what's expected to definitely
_not_ work, so that people don't bloody their noses on that. :)
Post by David Hildenbrand
Post by Florian Haas
So to clarify things, could you enumerate the currently known
limitations when enabling nesting? I'd be happy to summarize those and
add them to the linux-kvm.org FAQ so others are less likely to hit
The general problem is that migration of an L1 will not work when it is
running L2, so when L1 is using VMX ("nVMX").
Migrating an L2 should work as before.
The problem is, in order for L1 to make use of VMX to run L2, we have to
run L2 in L0, simulating VMX -> nested VMX a.k.a. nVMX . This requires
additional state information about L1 ("nVMX" state), which is not
properly migrated when migrating L1. Therefore, after migration, the CPU
state of L1 might be screwed up after migration, resulting in L1 crashes.
In addition, certain VMX features might be missing on the target, which
also still has to be handled via the CPU model in the future.
Thanks a bunch for the added detail. Now I got a primer today from
Kashyap on IRC on how savevm/loadvm is very similar to migration, but
I'm still struggling to wrap my head around it. What you say makes
perfect sense to me in that _migration_ might blow up in subtle ways,
but can you try to explain to me why the same considerations would
apply with savevm/loadvm?
Post by David Hildenbrand
L0, should hopefully not crash, I hope that you are not seeing that.
No I am not; we're good there. :)
Post by David Hildenbrand
Post by Florian Haas
- Is https://fedoraproject.org/wiki/How_to_enable_nested_virtualization_in_KVM
still accurate in that -cpu host (libvirt "host-passthrough") is the
strongly recommended configuration for the L2 guest?
- If so, are there any recommendations for how to configure the L1
guest with regard to CPU model?
You have to indicate the VMX feature to your L1 ("nested hypervisor"),
that is usually automatically done by using the "host-passthrough" or
"host-model" value. If you're using a custom CPU model, you have to
enable it explicitly.
Roger. Without that we can't do nesting at all.
Post by David Hildenbrand
Post by Florian Haas
- Is live migration with nested guests _always_ expected to break on
all architectures, and if not, which are safe?
x86 VMX: running nested guests works, migrating nested hypervisors does
not work
x86 SVM: running nested guests works, migrating nested hypervisor does
not work (somebody correct me if I'm wrong)
s390x: running nested guests works, migrating nested hypervisors works
power: running nested guests works only via KVM-PR ("trap and emulate").
migrating nested hypervisors therefore works. But we are not using
hardware virtualization for L1->L2. (my latest status)
arm: running nested guests is in the works (my latest status), migration
is therefore also not possible.
Great summary, thanks!
Post by David Hildenbrand
Post by Florian Haas
- Idem, for savevm/loadvm?
savevm/loadvm is not expected to work correctly on an L1 if it is
running L2 guests. It should work on L2 however.
Again, I'm somewhat struggling to understand this vs. live migration —
but it's entirely possible that I'm sorely lacking in my knowledge of
kernel and CPU internals.
Post by David Hildenbrand
Post by Florian Haas
- With regard to the problem that Kashyap and I (and Dennis, the
kernel.org bugzilla reporter) are describing, is this expected to work
any better on AMD CPUs? (All reports are on Intel)
No, remeber that they are also still missing migration support of the
nested SVM state.
Understood, thanks.
Post by David Hildenbrand
Post by Florian Haas
- Do you expect nested virtualization functionality to be adversely
affected by KPTI and/or other Meltdown/Spectre mitigation patches?
Not an expert on this. I think it should be affected in a similar way as
ordinary guests :)
Fair enough. :)
Post by David Hildenbrand
Post by Florian Haas
Kashyap, can you think of any other limitations that would benefit
from improved documentation?
We should certainly document what I have summaries here properly at a
central palce!
I tried getting registered on the linux-kvm.org wiki to do exactly
that, and ran into an SMTP/DNS configuration issue with the
verification email. Kashyap said he was going to poke the site admin
about that.

Now, here's a bit more information on my continued testing. As I
mentioned on IRC, one of the things that struck me as odd was that if
I ran into the issue previously described, the L1 guest would enter a
reboot loop if configured with kernel.panic_on_oops=1. In other words,
I would savevm the L1 guest (with a running L2), then loadvm it, and
then the L1 would stack-trace, reboot, and then keep doing that
indefinitely. I found that weird because on the second reboot, I would
expect the system to come up cleanly.

I've now changed my L2 guest's CPU configuration so that libvirt (in
L1) starts the L2 guest with the following settings:

<cpu>
<model fallback='forbid'>Haswell-noTSX</model>
<vendor>Intel</vendor>
<feature policy='disable' name='vme'/>
<feature policy='disable' name='ss'/>
<feature policy='disable' name='f16c'/>
<feature policy='disable' name='rdrand'/>
<feature policy='disable' name='hypervisor'/>
<feature policy='disable' name='arat'/>
<feature policy='disable' name='tsc_adjust'/>
<feature policy='disable' name='xsaveopt'/>
<feature policy='disable' name='abm'/>
<feature policy='disable' name='aes'/>
<feature policy='disable' name='invpcid'/>
</cpu>

Basically, I am disabling every single feature that my L1's "virsh
capabilities" reports. Now this does not make my L1 come up happily
from loadvm. But it does seem to initiate a clean reboot after loadvm,
and after that clean reboot it lives happily.

If this is as good as it gets (for now), then I can totally live with
that. It certainly beats running the L2 guest with Qemu (without KVM
acceleration). But I would still love to understand the issue a little
bit better.

Cheers,
Florian
David Hildenbrand
2018-02-08 13:47:26 UTC
Permalink
Post by Florian Haas
Sure, I do understand that Red Hat (or any other vendor) is taking no
support responsibility for this. At this point I'd just like to
contribute to a better understanding of what's expected to definitely
_not_ work, so that people don't bloody their noses on that. :)
Indeed. nesting is nice to enable as it works in 99% of all cases. It
just doesn't work when trying to migrate a nested hypervisor. (on x86)

That's what most people don't realize, as it works "just fine" for 99%
of all use cases.

[...]
Post by Florian Haas
Post by David Hildenbrand
savevm/loadvm is not expected to work correctly on an L1 if it is
running L2 guests. It should work on L2 however.
Again, I'm somewhat struggling to understand this vs. live migration —
but it's entirely possible that I'm sorely lacking in my knowledge of
kernel and CPU internals.
(savevm/loadvm is also called "migration to file")

When we migrate to a file, it really is the same migration stream. You
"dump" the VM state into a file, instead of sending it over to another
(running) target.

Once you load your VM state from that file, it is a completely fresh
VM/KVM environment. So you have to restore all the state. Now, as nVMX
state is not contained in the migration stream, you cannot restore that
state. The L1 state is therefore "damaged" or incomplete.

[...]
Post by Florian Haas
Post by David Hildenbrand
Post by Florian Haas
Kashyap, can you think of any other limitations that would benefit
from improved documentation?
We should certainly document what I have summaries here properly at a
central palce!
I tried getting registered on the linux-kvm.org wiki to do exactly
that, and ran into an SMTP/DNS configuration issue with the
verification email. Kashyap said he was going to poke the site admin
about that.
Now, here's a bit more information on my continued testing. As I
mentioned on IRC, one of the things that struck me as odd was that if
I ran into the issue previously described, the L1 guest would enter a
reboot loop if configured with kernel.panic_on_oops=1. In other words,
I would savevm the L1 guest (with a running L2), then loadvm it, and
then the L1 would stack-trace, reboot, and then keep doing that
indefinitely. I found that weird because on the second reboot, I would
expect the system to come up cleanly.
Guess the L1 state (in the kernel) is broken that hard, that even a
reset cannot fix it.
Post by Florian Haas
I've now changed my L2 guest's CPU configuration so that libvirt (in
<cpu>
<model fallback='forbid'>Haswell-noTSX</model>
<vendor>Intel</vendor>
<feature policy='disable' name='vme'/>
<feature policy='disable' name='ss'/>
<feature policy='disable' name='f16c'/>
<feature policy='disable' name='rdrand'/>
<feature policy='disable' name='hypervisor'/>
<feature policy='disable' name='arat'/>
<feature policy='disable' name='tsc_adjust'/>
<feature policy='disable' name='xsaveopt'/>
<feature policy='disable' name='abm'/>
<feature policy='disable' name='aes'/>
<feature policy='disable' name='invpcid'/>
</cpu>
Maybe one of these features is the root cause of the "messed up" state
in KVM. So disabling it also makes the L1 state "less broken".
Post by Florian Haas
Basically, I am disabling every single feature that my L1's "virsh
capabilities" reports. Now this does not make my L1 come up happily
from loadvm. But it does seem to initiate a clean reboot after loadvm,
and after that clean reboot it lives happily.
If this is as good as it gets (for now), then I can totally live with
that. It certainly beats running the L2 guest with Qemu (without KVM
acceleration). But I would still love to understand the issue a little
bit better.
I mean the real solution to the problem is of course restoring the L1
state correctly (migrating nVMX state, what people are working on right
now). So what you are seeing is a bad "side effect" of that.

For now, nested=true should never be used along with savevm/loadvm/live
migration
Post by Florian Haas
Cheers,
Florian
--
Thanks,

David / dhildenb
Florian Haas
2018-02-08 13:57:33 UTC
Permalink
Post by David Hildenbrand
Post by Florian Haas
Again, I'm somewhat struggling to understand this vs. live migration —
but it's entirely possible that I'm sorely lacking in my knowledge of
kernel and CPU internals.
(savevm/loadvm is also called "migration to file")
When we migrate to a file, it really is the same migration stream. You
"dump" the VM state into a file, instead of sending it over to another
(running) target.
Once you load your VM state from that file, it is a completely fresh
VM/KVM environment. So you have to restore all the state. Now, as nVMX
state is not contained in the migration stream, you cannot restore that
state. The L1 state is therefore "damaged" or incomplete.
*lightbulb* Thanks a lot, that's a perfectly logical explanation. :)
Post by David Hildenbrand
Post by Florian Haas
Now, here's a bit more information on my continued testing. As I
mentioned on IRC, one of the things that struck me as odd was that if
I ran into the issue previously described, the L1 guest would enter a
reboot loop if configured with kernel.panic_on_oops=1. In other words,
I would savevm the L1 guest (with a running L2), then loadvm it, and
then the L1 would stack-trace, reboot, and then keep doing that
indefinitely. I found that weird because on the second reboot, I would
expect the system to come up cleanly.
Guess the L1 state (in the kernel) is broken that hard, that even a
reset cannot fix it.
... which would also explain that in contrast to that, a virsh
destroy/virsh start cycle does fix things.
Post by David Hildenbrand
Post by Florian Haas
I've now changed my L2 guest's CPU configuration so that libvirt (in
<cpu>
<model fallback='forbid'>Haswell-noTSX</model>
<vendor>Intel</vendor>
<feature policy='disable' name='vme'/>
<feature policy='disable' name='ss'/>
<feature policy='disable' name='f16c'/>
<feature policy='disable' name='rdrand'/>
<feature policy='disable' name='hypervisor'/>
<feature policy='disable' name='arat'/>
<feature policy='disable' name='tsc_adjust'/>
<feature policy='disable' name='xsaveopt'/>
<feature policy='disable' name='abm'/>
<feature policy='disable' name='aes'/>
<feature policy='disable' name='invpcid'/>
</cpu>
Maybe one of these features is the root cause of the "messed up" state
in KVM. So disabling it also makes the L1 state "less broken".
Would you try a guess as to which of the above features is a likely culprit?
Post by David Hildenbrand
Post by Florian Haas
Basically, I am disabling every single feature that my L1's "virsh
capabilities" reports. Now this does not make my L1 come up happily
from loadvm. But it does seem to initiate a clean reboot after loadvm,
and after that clean reboot it lives happily.
If this is as good as it gets (for now), then I can totally live with
that. It certainly beats running the L2 guest with Qemu (without KVM
acceleration). But I would still love to understand the issue a little
bit better.
I mean the real solution to the problem is of course restoring the L1
state correctly (migrating nVMX state, what people are working on right
now). So what you are seeing is a bad "side effect" of that.
For now, nested=true should never be used along with savevm/loadvm/live
migration.
Yes, I gathered as much. :) Thanks again!

Cheers,
Florian
David Hildenbrand
2018-02-08 14:55:51 UTC
Permalink
Post by Florian Haas
Post by David Hildenbrand
Post by Florian Haas
I've now changed my L2 guest's CPU configuration so that libvirt (in
<cpu>
<model fallback='forbid'>Haswell-noTSX</model>
<vendor>Intel</vendor>
<feature policy='disable' name='vme'/>
<feature policy='disable' name='ss'/>
<feature policy='disable' name='f16c'/>
<feature policy='disable' name='rdrand'/>
<feature policy='disable' name='hypervisor'/>
<feature policy='disable' name='arat'/>
<feature policy='disable' name='tsc_adjust'/>
<feature policy='disable' name='xsaveopt'/>
<feature policy='disable' name='abm'/>
<feature policy='disable' name='aes'/>
<feature policy='disable' name='invpcid'/>
</cpu>
Maybe one of these features is the root cause of the "messed up" state
in KVM. So disabling it also makes the L1 state "less broken".
Would you try a guess as to which of the above features is a likely culprit?
Hmm, actually no idea, but you can bisect :)

(but watch out, it could also just be "coincidence". Especially if you
migrate while all VCPUs of L1 are currently not executing L2, chances
might be better for L1 to survive a migration - L2 will still fail hard,
and L1 certainly, too when trying to run L2 again)
--
Thanks,

David / dhildenb
Daniel P. Berrangé
2018-02-08 14:59:33 UTC
Permalink
Post by David Hildenbrand
Post by Florian Haas
Sure, I do understand that Red Hat (or any other vendor) is taking no
support responsibility for this. At this point I'd just like to
contribute to a better understanding of what's expected to definitely
_not_ work, so that people don't bloody their noses on that. :)
Indeed. nesting is nice to enable as it works in 99% of all cases. It
just doesn't work when trying to migrate a nested hypervisor. (on x86)
Hmm, if migration of the L1 is going to cause things to crash and
burn, then ideally libvirt on L0 would block the migration from being
done.

Naively we could do that if the guest has vmx or svm features in its
CPU, except that's probably way too conservative as many guests with
those features won't actually do any nested VMs. It would also be
desirable to still be able to migrate the L1, if no L2s are running
currently.

Is there any way QEMU can expose whether there's any L2s activated
to libvirt, so we can prevent migration in that case ? Or should
QEMU itself refuse to start migration perhaps ?


Regards,
Daniel
--
|: https://berrange.com -o- https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o- https://fstop138.berrange.com :|
|: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
David Hildenbrand
2018-02-08 15:11:03 UTC
Permalink
Post by Daniel P. Berrangé
Post by David Hildenbrand
Post by Florian Haas
Sure, I do understand that Red Hat (or any other vendor) is taking no
support responsibility for this. At this point I'd just like to
contribute to a better understanding of what's expected to definitely
_not_ work, so that people don't bloody their noses on that. :)
Indeed. nesting is nice to enable as it works in 99% of all cases. It
just doesn't work when trying to migrate a nested hypervisor. (on x86)
Hmm, if migration of the L1 is going to cause things to crash and
burn, then ideally libvirt on L0 would block the migration from being
done.
Yes, in an ideal world. Usually we assume that people that turn on
experimental features ("nested=true") are aware of what the implications
are. The main problem is that the implications are not really documented :)

Once, with new KVM _and_ new QEMU it will eventually be supported.
Post by Daniel P. Berrangé
Naively we could do that if the guest has vmx or svm features in its
CPU, except that's probably way too conservative as many guests with
those features won't actually do any nested VMs. It would also be
desirable to still be able to migrate the L1, if no L2s are running
currently.
No using CPU feature flags for that purpose on the libvirt level is no
good, and especially once we support migration we would have to find
another interface to tell "but it is now working".

QEMU could try to warn the user if VMX is enabled in the CPU model, but
as you said, that might also hold true for guests that don't use nVMX.

On the other hand, VMX will only pop up as a valid feature if
nested=true is set. So the amount of affected users is minimal.

So we could e.g. abort migration on the QEMU level if VMX is specified
right now. Once we have the migration support in place, we can allow it
again.
Post by Daniel P. Berrangé
Is there any way QEMU can expose whether there's any L2s activated
to libvirt, so we can prevent migration in that case ? Or should
QEMU itself refuse to start migration perhaps ?
Not without another kernel interface.

But I am no expert on that matter. Maybe there would be an easy way to
block that I just don't see right now.
Post by Daniel P. Berrangé
Regards,
Daniel
--
Thanks,

David / dhildenb
Kashyap Chamarthy
2018-02-08 14:45:17 UTC
Permalink
On Thu, Feb 08, 2018 at 01:07:33PM +0100, David Hildenbrand wrote:

[...]
Post by David Hildenbrand
Post by Florian Haas
So to clarify things, could you enumerate the currently known
limitations when enabling nesting? I'd be happy to summarize those and
add them to the linux-kvm.org FAQ so others are less likely to hit
[...] # Snip description of what works in context of migration
Post by David Hildenbrand
Post by Florian Haas
- Is https://fedoraproject.org/wiki/How_to_enable_nested_virtualization_in_KVM
still accurate in that -cpu host (libvirt "host-passthrough") is the
strongly recommended configuration for the L2 guest?
That wiki is a bit outdated. And it is not accurate — if we can just
expose the Intel 'vmx' (or AMD 'svm') CPU feature flag to the L2 guest,
that should be sufficient. No need for a full passthrough.

That above document should definitely be modified to add more verbiage
comparing 'host-passthrough' vs. 'host-model' vs. custom CPU.
Post by David Hildenbrand
Post by Florian Haas
- If so, are there any recommendations for how to configure the L1
guest with regard to CPU model?
You have to indicate the VMX feature to your L1 ("nested hypervisor"),
that is usually automatically done by using the "host-passthrough" or
"host-model" value. If you're using a custom CPU model, you have to
enable it explicitly.
Post by Florian Haas
- Is live migration with nested guests _always_ expected to break on
all architectures, and if not, which are safe?
x86 VMX: running nested guests works, migrating nested hypervisors does
not work
x86 SVM: running nested guests works, migrating nested hypervisor does
not work (somebody correct me if I'm wrong)
s390x: running nested guests works, migrating nested hypervisors works
power: running nested guests works only via KVM-PR ("trap and emulate").
migrating nested hypervisors therefore works. But we are not using
hardware virtualization for L1->L2. (my latest status)
arm: running nested guests is in the works (my latest status), migration
is therefore also not possible.
That's a great summary.
Post by David Hildenbrand
Post by Florian Haas
- Idem, for savevm/loadvm?
savevm/loadvm is not expected to work correctly on an L1 if it is
running L2 guests. It should work on L2 however.
Yes, that works as intended.
Post by David Hildenbrand
Post by Florian Haas
- With regard to the problem that Kashyap and I (and Dennis, the
kernel.org bugzilla reporter) are describing, is this expected to work
any better on AMD CPUs? (All reports are on Intel)
No, remeber that they are also still missing migration support of the
nested SVM state.
Right. I partly mixed up migration of L1-running-L2 (which doesn't fly
for reasons David already explained) vs. migrating L2 (which works).
Post by David Hildenbrand
Post by Florian Haas
- Do you expect nested virtualization functionality to be adversely
affected by KPTI and/or other Meltdown/Spectre mitigation patches?
Not an expert on this. I think it should be affected in a similar way as
ordinary guests :)
Post by Florian Haas
Kashyap, can you think of any other limitations that would benefit
from improved documentation?
We should certainly document what I have summaries here properly at a
central palce!
Yeah, agreed. Also, when documentation in context of nested, it'd be
useful to explicitly spell out what works or doesn't work at each level
— e.g. L2 can be migrated to a destination L1 just fine; mirating an
L1-running-L2 to a destination L0 will be in dodgy waters for reasons X,
etc.

[...]
--
/kashyap
Florian Haas
2018-02-08 17:44:43 UTC
Permalink
Post by David Hildenbrand
We should certainly document what I have summaries here properly at a
central palce!
Please review the three edits I've submitted to the wiki:
https://www.linux-kvm.org/page/Special:Contributions/Fghaas

Feel free to ruthlessly edit/roll back anything that is inaccurate. Thanks!

Cheers,
Florian
Kashyap Chamarthy
2018-02-09 10:48:30 UTC
Permalink
Post by Florian Haas
Post by David Hildenbrand
We should certainly document what I have summaries here properly at a
central palce!
https://www.linux-kvm.org/page/Special:Contributions/Fghaas
Feel free to ruthlessly edit/roll back anything that is inaccurate. Thanks!
I've made some minor edits to clarify a bunch of bits, and a link to the
Kernel doc about Intel nVMX. (Hope that looks fine.)

You wrote: "L2...which does no further virtualization". Not quite true
— "under right circumstances" (read: sufficiently huge machine with tons
of RAM), L2 _can_ in turn L3. :-)

Last time I checked (this morning), Rich W.M. Jones had 4 levels of
nesting tested with the 'supernested' program[1] he wrote. (Related
aside: This program is packaged it as part of 2016 QEMU Advent
Calendar[2] -- if you want to play around on a powerful test machine
with tons of free memory.)

[1] http://git.annexia.org/?p=supernested.git;a=blob;f=README
[2] http://www.qemu-advent-calendar.org/2016/#day-13
--
/kashyap
Florian Haas
2018-02-09 11:02:25 UTC
Permalink
Post by Kashyap Chamarthy
Post by Florian Haas
Post by David Hildenbrand
We should certainly document what I have summaries here properly at a
central palce!
https://www.linux-kvm.org/page/Special:Contributions/Fghaas
Feel free to ruthlessly edit/roll back anything that is inaccurate. Thanks!
I've made some minor edits to clarify a bunch of bits, and a link to the
Kernel doc about Intel nVMX. (Hope that looks fine.)
I'm sure they it does, but just so you know I currently don't see any
edits from you on the Nested Guests page. Are you sure you
saved/published your changes?
Post by Kashyap Chamarthy
You wrote: "L2...which does no further virtualization". Not quite true
— "under right circumstances" (read: sufficiently huge machine with tons
of RAM), L2 _can_ in turn L3. :-)
Insert "normally" between "which" and "does", then. :)
Post by Kashyap Chamarthy
Last time I checked (this morning), Rich W.M. Jones had 4 levels of
nesting tested with the 'supernested' program[1] he wrote. (Related
aside: This program is packaged it as part of 2016 QEMU Advent
Calendar[2] -- if you want to play around on a powerful test machine
with tons of free memory.)
[1] http://git.annexia.org/?p=supernested.git;a=blob;f=README
[2] http://www.qemu-advent-calendar.org/2016/#day-13
Interesting, thanks for the pointer!

Cheers,
Florian
Kashyap Chamarthy
2018-02-12 09:27:16 UTC
Permalink
[...]
Post by Florian Haas
Post by Kashyap Chamarthy
I've made some minor edits to clarify a bunch of bits, and a link to the
Kernel doc about Intel nVMX. (Hope that looks fine.)
I'm sure they it does, but just so you know I currently don't see any
edits from you on the Nested Guests page. Are you sure you
saved/published your changes?
Thanks for catching that. _Now_ it's updated.

https://www.linux-kvm.org/page/Nested_Guests

(I also didn't have permissions to add external links; had to get that
sorted out with the admin.)
Post by Florian Haas
Post by Kashyap Chamarthy
You wrote: "L2...which does no further virtualization". Not quite true
— "under right circumstances" (read: sufficiently huge machine with tons
of RAM), L2 _can_ in turn L3. :-)
Insert "normally" between "which" and "does", then. :)
:-)
Post by Florian Haas
Post by Kashyap Chamarthy
Last time I checked (this morning), Rich W.M. Jones had 4 levels of
nesting tested with the 'supernested' program[1] he wrote. (Related
aside: This program is packaged it as part of 2016 QEMU Advent
Calendar[2] -- if you want to play around on a powerful test machine
with tons of free memory.)
[1] http://git.annexia.org/?p=supernested.git;a=blob;f=README
[2] http://www.qemu-advent-calendar.org/2016/#day-13
Interesting, thanks for the pointer!
Cheers,
Florian
--
/kashyap
Florian Haas
2018-02-12 14:07:50 UTC
Permalink
Post by Kashyap Chamarthy
[...]
Post by Florian Haas
Post by Kashyap Chamarthy
I've made some minor edits to clarify a bunch of bits, and a link to the
Kernel doc about Intel nVMX. (Hope that looks fine.)
I'm sure they it does, but just so you know I currently don't see any
edits from you on the Nested Guests page. Are you sure you
saved/published your changes?
Thanks for catching that. _Now_ it's updated.
https://www.linux-kvm.org/page/Nested_Guests
Got it. Thanks for those additions!
Post by Kashyap Chamarthy
(I also didn't have permissions to add external links; had to get that
sorted out with the admin.)
Right, I saw that on my first edit attempt too.

I took the liberty to back-reference this wiki page from those two
bugzilla entries too (the kernel.org one and the Red Hat one). Since
as I must confess, I don't follow KVM development on a day-to-day
basis, I'm hopeful that at least one of those bugs will get updated
triggering a notification, so that I can update that page once
migration in combination with nVMX does work.

I've also added a link to
https://bugzilla.kernel.org/show_bug.cgi?id=53851 to the wiki — 5
years old just this week, as it happens. :)

Thanks for everyone's help explaining this issue to me!

Cheers,
Florian

Kashyap Chamarthy
2018-02-08 10:46:24 UTC
Permalink
[...]
Post by David Hildenbrand
Sounds like a similar problem as in
https://bugzilla.kernel.org/show_bug.cgi?id=198621
In short: there is no (live) migration support for nested VMX yet. So as
soon as your guest is using VMX itself ("nVMX"), this is not expected to
work.
Actually, live migration with nVMX _does_ work insofar as you have
_identical_ CPUs on both source and destination — i.e. use the QEMU
'-cpu host' for the L1 guests. At least that's been the case in my
experience. FWIW, I frequently use that setup in my test environments.

Just to be quadruple sure, I did the test: Migrate an L2 guest (with
non-shared storage), and it worked just fine. (No 'oops'es, no stack
traces, no "kernel BUG" in `dmesg` or serial consoles on L1s. And I can
login to the L2 guest on the destination L1 just fine.)

Once you have the password-less SSH between source and destination, and
a bit of libvirt config setup. I ran the migrate command as following:

$ virsh migrate --verbose --copy-storage-all \
--live cvm1 qemu+tcp://***@f26-vm2/system
Migration: [100 %]
$ echo $?
0

Full details:
https://kashyapc.fedorapeople.org/virt/Migrate-a-nested-guest-08Feb2018.txt

(At the end of the document above, I also posted the libvirt config and
the version details across L0, L1 and L2. So this is a fully repeatable
test.)
--
/kashyap
Kashyap Chamarthy
2018-02-08 11:34:24 UTC
Permalink
Post by Kashyap Chamarthy
[...]
Post by David Hildenbrand
Sounds like a similar problem as in
https://bugzilla.kernel.org/show_bug.cgi?id=198621
In short: there is no (live) migration support for nested VMX yet. So as
soon as your guest is using VMX itself ("nVMX"), this is not expected to
work.
Actually, live migration with nVMX _does_ work insofar as you have
_identical_ CPUs on both source and destination — i.e. use the QEMU
'-cpu host' for the L1 guests. At least that's been the case in my
experience. FWIW, I frequently use that setup in my test environments.
Correcting my erroneous statement above: For live migration to work in a
nested KVM setup, it is _not_ mandatory to use "-cpu host".

I just did another test. Here I used libvirt's 'host-model' for both
source and destination L1 guests, _and_ for L2 guest. Migrated the L2
to destination L1, worked great.

In my setup, both my L1 guests recieved the following CPU configuration
(in QEMU command-line):

[...]
-cpu Haswell-noTSX,vme=on,ss=on,vmx=on,f16c=on,rdrand=on,\
hypervisor=on,arat=on,tsc_adjust=on,xsaveopt=on,pdpe1gb=on,abm=on,aes=off
[...]

And the L2 guest recieved this:

[...]
-cpu Haswell-noTSX,vme=on,ss=on,f16c=on,rdrand=on,hypervisor=on,\
arat=on,tsc_adjust=on,xsaveopt=on,pdpe1gb=on,abm=on,aes=off,invpcid=off
[...]
--
/kashyap
Daniel P. Berrangé
2018-02-08 11:40:48 UTC
Permalink
Post by Kashyap Chamarthy
Post by Kashyap Chamarthy
[...]
Post by David Hildenbrand
Sounds like a similar problem as in
https://bugzilla.kernel.org/show_bug.cgi?id=198621
In short: there is no (live) migration support for nested VMX yet. So as
soon as your guest is using VMX itself ("nVMX"), this is not expected to
work.
Actually, live migration with nVMX _does_ work insofar as you have
_identical_ CPUs on both source and destination — i.e. use the QEMU
'-cpu host' for the L1 guests. At least that's been the case in my
experience. FWIW, I frequently use that setup in my test environments.
Correcting my erroneous statement above: For live migration to work in a
nested KVM setup, it is _not_ mandatory to use "-cpu host".
Yes, assuming the L1 guests both get given the same CPU model, then you
can use any CPU model at all for the L2 guests and still be migrate safe,
since your L1 guests provide homogeneous hardware to host L2, regardless
of whether the L0 host is homogeneous.


Regards,
Daniel
--
|: https://berrange.com -o- https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o- https://fstop138.berrange.com :|
|: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
David Hildenbrand
2018-02-08 11:48:46 UTC
Permalink
Post by Kashyap Chamarthy
[...]
Post by David Hildenbrand
Sounds like a similar problem as in
https://bugzilla.kernel.org/show_bug.cgi?id=198621
In short: there is no (live) migration support for nested VMX yet. So as
soon as your guest is using VMX itself ("nVMX"), this is not expected to
work.
Actually, live migration with nVMX _does_ work insofar as you have
_identical_ CPUs on both source and destination — i.e. use the QEMU
'-cpu host' for the L1 guests. At least that's been the case in my
experience. FWIW, I frequently use that setup in my test environments.
Your mixing use cases. While you talk about migrating a L2, this is
about migrating an L1, running L2.

Migrating an L2 is expected to work just like when migrating an L1, not
running L2. (of course, the usual trouble with CPU models, but upper
layers should check and handle that).
Post by Kashyap Chamarthy
Just to be quadruple sure, I did the test: Migrate an L2 guest (with
non-shared storage), and it worked just fine. (No 'oops'es, no stack
traces, no "kernel BUG" in `dmesg` or serial consoles on L1s. And I can
login to the L2 guest on the destination L1 just fine.)
Once you have the password-less SSH between source and destination, and
$ virsh migrate --verbose --copy-storage-all \
Migration: [100 %]
$ echo $?
0
https://kashyapc.fedorapeople.org/virt/Migrate-a-nested-guest-08Feb2018.txt
(At the end of the document above, I also posted the libvirt config and
the version details across L0, L1 and L2. So this is a fully repeatable
test.)
--
Thanks,

David / dhildenb
Kashyap Chamarthy
2018-02-08 15:23:11 UTC
Permalink
Post by David Hildenbrand
Post by Kashyap Chamarthy
[...]
Post by David Hildenbrand
Sounds like a similar problem as in
https://bugzilla.kernel.org/show_bug.cgi?id=198621
In short: there is no (live) migration support for nested VMX yet. So as
soon as your guest is using VMX itself ("nVMX"), this is not expected to
work.
Actually, live migration with nVMX _does_ work insofar as you have
_identical_ CPUs on both source and destination — i.e. use the QEMU
'-cpu host' for the L1 guests. At least that's been the case in my
experience. FWIW, I frequently use that setup in my test environments.
Your mixing use cases. While you talk about migrating a L2, this is
about migrating an L1, running L2.
Yes, you're right. I mixed up briefly, and corrected myself in the
other email. We're on the same page.
Post by David Hildenbrand
Migrating an L2 is expected to work just like when migrating an L1, not
running L2. (of course, the usual trouble with CPU models, but upper
layers should check and handle that).
Yep.

---

Aside:

I also remember seeing Vitaly's nice talk[*] at FOSDEM last weekend ("A
slightly different kind of nesting"), where he talks about how nVMX
actually works in context of Intel's hardware feature "VMCS Shadowing",
to reduce number of VMEXITs and VMENTRYs.

(Particularly look at his slides 8, 9 and 10.)

I reproduced his diagram from his slide-10 ("How nested virtualization
really works on Intel") in ASCII here:

.---------------------------------------.
| | VMCS L1->L2 | |
| '-------------' |
| | |
| L1 (guest | L2 (nested |
| hypervisor) | guest) |
| | |
| | |
.---------------------------------------.
| VMCS L0->L1 | VMCS L0->L2 |
.---------------------------------------.
| |
| L0 hypervisor |
'---------------------------------------'
| |
| Hardware |
'---------------------------------------'


[*] https://fosdem.org/2018/schedule/event/vai_kvm_on_hyperv/attachments/slides/2200/export/events/attachments/vai_kvm_on_hyperv/slides/2200/slides_fosdem2018_vkuznets.pdf

[...]
--
/kashyap
Loading...