[libvirt-users] Nested KVM: L0 guest produces kernel BUG on wakeup from managed save (while a nested VM is running)

Discussion:

Florian Haas

2018-02-06 15:11:46 UTC

Hi everyone,

I hope this is the correct list to discuss this issue; please feel
free to redirect me otherwise.

I have a nested virtualization setup that looks as follows:

- Host: Ubuntu 16.04, kernel 4.4.0 (an OpenStack Nova compute node)
- L0 guest: openSUSE Leap 42.3, kernel 4.4.104-39-default
- Nested guest: SLES 12, kernel 3.12.28-4-default

The nested guest is configured with "<type arch='x86_64'
machine='pc-i440fx-1.4'>hvm</type>".

This is working just beautifully, except when the L0 guest wakes up
from managed save (openstack server resume in OpenStack parlance).
Then, in the L0 guest we immediately see this:

[Tue Feb 6 07:00:37 2018] ------------[ cut here ]------------
[Tue Feb 6 07:00:37 2018] kernel BUG at ../arch/x86/kvm/x86.c:328!
[Tue Feb 6 07:00:37 2018] invalid opcode: 0000 [#1] SMP
[Tue Feb 6 07:00:37 2018] Modules linked in: fuse vhost_net vhost
macvtap macvlan xt_CHECKSUM iptable_mangle ipt_MASQUERADE
nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat
nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT
nf_reject_ipv4 xt_tcpudp tun br_netfilter bridge stp llc
ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter
ip_tables x_tables vboxpci(O) vboxnetadp(O) vboxnetflt(O) af_packet
iscsi_ibft iscsi_boot_sysfs vboxdrv(O) kvm_intel kvm irqbypass
crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel
hid_generic usbhid jitterentropy_rng drbg ansi_cprng ppdev parport_pc
floppy parport joydev aesni_intel processor button aes_x86_64
virtio_balloon virtio_net lrw gf128mul glue_helper pcspkr serio_raw
ablk_helper cryptd i2c_piix4 ext4 crc16 jbd2 mbcache ata_generic
[Tue Feb 6 07:00:37 2018] virtio_blk ata_piix ahci libahci cirrus(O)
drm_kms_helper(O) syscopyarea sysfillrect sysimgblt fb_sys_fops ttm(O)
drm(O) virtio_pci virtio_ring virtio uhci_hcd ehci_hcd usbcore
usb_common libata sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc
scsi_dh_alua scsi_mod autofs4
[Tue Feb 6 07:00:37 2018] CPU: 2 PID: 2041 Comm: CPU 0/KVM Tainted: G
W O 4.4.104-39-default #1
[Tue Feb 6 07:00:37 2018] Hardware name: OpenStack Foundation
OpenStack Nova, BIOS 1.10.1-1ubuntu1~cloud0 04/01/2014
[Tue Feb 6 07:00:37 2018] task: ffff880037108d80 ti: ffff88042e964000
task.ti: ffff88042e964000
[Tue Feb 6 07:00:37 2018] RIP: 0010:[<ffffffffa04f20e5>]
[<ffffffffa04f20e5>] kvm_spurious_fault+0x5/0x10 [kvm]
[Tue Feb 6 07:00:37 2018] RSP: 0018:ffff88042e967d70 EFLAGS: 00010246
[Tue Feb 6 07:00:37 2018] RAX: 0000000000000000 RBX: ffff88042c4f0040
RCX: 0000000000000000
[Tue Feb 6 07:00:37 2018] RDX: 0000000000006820 RSI: 0000000000000282
RDI: ffff88042c4f0040
[Tue Feb 6 07:00:37 2018] RBP: ffff88042c4f00d8 R08: ffff88042e964000
R09: 0000000000000002
[Tue Feb 6 07:00:37 2018] R10: 0000000000000004 R11: 0000000000000000
R12: 0000000000000001
[Tue Feb 6 07:00:37 2018] R13: 0000021d34fbb21d R14: 0000000000000001
R15: 000055d2157cf840
[Tue Feb 6 07:00:37 2018] FS: 00007f7c52b96700(0000)
GS:ffff88043fd00000(0000) knlGS:0000000000000000
[Tue Feb 6 07:00:37 2018] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Tue Feb 6 07:00:37 2018] CR2: 00007f823b15f000 CR3: 0000000429334000
CR4: 0000000000362670
[Tue Feb 6 07:00:37 2018] DR0: 0000000000000000 DR1: 0000000000000000
DR2: 0000000000000000
[Tue Feb 6 07:00:37 2018] DR3: 0000000000000000 DR6: 00000000fffe0ff0
DR7: 0000000000000400
[Tue Feb 6 07:00:37 2018] Stack:
[Tue Feb 6 07:00:37 2018] ffffffffa07939b1 ffffffffa0787875
ffffffffa0503a60 ffff88042c4f0040
[Tue Feb 6 07:00:37 2018] ffffffffa04e5ede ffff88042c4f0040
ffffffffa04e6f0f ffff880037108d80
[Tue Feb 6 07:00:37 2018] ffff88042c4f00e0 ffff88042c4f00e0
ffff88042c4f0040 ffff88042e968000
[Tue Feb 6 07:00:37 2018] Call Trace:
[Tue Feb 6 07:00:37 2018] [<ffffffffa07939b1>]
intel_pmu_set_msr+0xfc1/0x2341 [kvm_intel]
[Tue Feb 6 07:00:37 2018] DWARF2 unwinder stuck at
intel_pmu_set_msr+0xfc1/0x2341 [kvm_intel]
[Tue Feb 6 07:00:37 2018] Leftover inexact backtrace:
[Tue Feb 6 07:00:37 2018] [<ffffffffa0787875>] ?
vmx_interrupt_allowed+0x15/0x30 [kvm_intel]
[Tue Feb 6 07:00:37 2018] [<ffffffffa0503a60>] ?
kvm_arch_vcpu_runnable+0xa0/0xd0 [kvm]
[Tue Feb 6 07:00:37 2018] [<ffffffffa04e5ede>] ?
kvm_vcpu_check_block+0xe/0x60 [kvm]
[Tue Feb 6 07:00:37 2018] [<ffffffffa04e6f0f>] ?
kvm_vcpu_block+0x8f/0x310 [kvm]
[Tue Feb 6 07:00:37 2018] [<ffffffffa0503c17>] ?
kvm_arch_vcpu_ioctl_run+0x187/0x400 [kvm]
[Tue Feb 6 07:00:37 2018] [<ffffffffa04ea6d9>] ?
kvm_vcpu_ioctl+0x359/0x680 [kvm]
[Tue Feb 6 07:00:37 2018] [<ffffffff81016689>] ? __switch_to+0x1c9/0x460
[Tue Feb 6 07:00:37 2018] [<ffffffff81224f02>] ? do_vfs_ioctl+0x322/0x5d0
[Tue Feb 6 07:00:37 2018] [<ffffffff811362ef>] ?
__audit_syscall_entry+0xaf/0x100
[Tue Feb 6 07:00:37 2018] [<ffffffff8100383b>] ?
syscall_trace_enter_phase1+0x15b/0x170
[Tue Feb 6 07:00:37 2018] [<ffffffff81225224>] ? SyS_ioctl+0x74/0x80
[Tue Feb 6 07:00:37 2018] [<ffffffff81634a02>] ?
entry_SYSCALL_64_fastpath+0x16/0xae
[Tue Feb 6 07:00:37 2018] Code: d7 fe ff ff 8b 2d 04 6e 06 00 e9 c2
fe ff ff 48 89 f2 e9 65 ff ff ff 0f 1f 44 00 00 66 2e 0f 1f 84 00 00
00 00 00 0f 1f 44 00 00 <0f> 0b 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00
00 55 89 ff 48 89
[Tue Feb 6 07:00:37 2018] RIP [<ffffffffa04f20e5>]
kvm_spurious_fault+0x5/0x10 [kvm]
[Tue Feb 6 07:00:37 2018] RSP <ffff88042e967d70>
[Tue Feb 6 07:00:37 2018] ---[ end trace e15c567f77920049 ]---

We only hit this kernel bug if we have a nested VM running. The exact
same setup, sent into managed save after shutting down the nested VM,
wakes up just fine.

Now I am aware of https://bugzilla.redhat.com/show_bug.cgi?id=1076294,
which talks about live migration — but I think the same considerations
apply.

I am also aware of
https://fedoraproject.org/wiki/How_to_enable_nested_virtualization_in_KVM,
which strongly suggests to use host-passthrough or host-model. I have
tried both, to no avail. The stack trace persists. I have also tried
running a 4.15 kernel in the L0 guest, from
https://kernel.opensuse.org/packages/stable, but again, the stack
trace persists.

What does fix things, of course, is to switch from the nested guest
from KVM to Qemu — but that also makes things significantly slower.

So I'm wondering: is there someone reading this who does run nested
KVM and has managed to successfully live-migrate or managed-save? If
so, would you be able to share a working host kernel / L0 guest kernel
/ nested guest kernel combination, or any other hints for tuning the
L0 guest to support managed save and live migration?

I'd be extraordinarily grateful for any suggestions. Thanks!

Cheers,
Florian

Kashyap Chamarthy

2018-02-07 15:31:08 UTC

Permalink

[Cc: KVM upstream list.]

Post by Florian Haas
Hi everyone,
I hope this is the correct list to discuss this issue; please feel
free to redirect me otherwise.
- Host: Ubuntu 16.04, kernel 4.4.0 (an OpenStack Nova compute node)
- L0 guest: openSUSE Leap 42.3, kernel 4.4.104-39-default
- Nested guest: SLES 12, kernel 3.12.28-4-default
The nested guest is configured with "<type arch='x86_64'
machine='pc-i440fx-1.4'>hvm</type>".
This is working just beautifully, except when the L0 guest wakes up
from managed save (openstack server resume in OpenStack parlance).

[...] # Snip the call trace from Florian. It is here:
https://www.redhat.com/archives/libvirt-users/2018-February/msg00014.html

Post by Florian Haas
What does fix things, of course, is to switch from the nested guest
from KVM to Qemu â but that also makes things significantly slower.
So I'm wondering: is there someone reading this who does run nested
KVM and has managed to successfully live-migrate or managed-save? If
so, would you be able to share a working host kernel / L0 guest kernel
/ nested guest kernel combination, or any other hints for tuning the
L0 guest to support managed save and live migration?

Following up from our IRC discussion (on #kvm, Freenode). Re-posting my
comment here:

So I just did a test of 'managedsave' (which is just "save the state of
the running VM to a file" in libvirt parlance) of L1, _while_ L2 is
running, and I seem to reproduce your case (see the call trace
attached).

# Ensure L2 (the nested guest) is running on L1. Then, from L0, do
# the following:
[L0] $ virsh managedsave L1
[L0] $ virsh start L1 --console

Result: See the call trace attached to this bug. But L1 goes on to
start "fine", and L2 keeps running, too. But things start to seem
weird. As in: I try to safely, read-only mount the L2 disk image via
libguestfs (by setting export LIBGUESTFS_BACKEND=direct, which uses
direct QEMU): `guestfish --ro -a -i ./cirros.qcow2`. It throws the call
trace again on the L1 serial console. And the `guestfish` command just
sits there forever

- L0 (bare metal) Kernel: 4.13.13-300.fc27.x86_64+debug
- L1 (guest hypervisor) kernel: 4.11.10-300.fc26.x86_64
- L2 is a CirrOS 3.5 image

I can reproduce this at least 3 times, with the above versions.

I'm using libvirt 'host-passthrough' for CPU (meaning: '-cpu host' in
QEMU parlance) for both L1 and L2.

My L0 CPU is: Intel(R) Xeon(R) CPU E5-2609 v3 @ 1.90GHz.

Thoughts?

---

[/me wonders if I'll be asked to reproduce this with newest upstream
kernels.]

[...]

--
/kashyap

David Hildenbrand

2018-02-07 22:26:14 UTC

Permalink

Post by Kashyap Chamarthy
[Cc: KVM upstream list.]

https://www.redhat.com/archives/libvirt-users/2018-February/msg00014.html

Post by Florian Haas
What does fix things, of course, is to switch from the nested guest
from KVM to Qemu — but that also makes things significantly slower.
So I'm wondering: is there someone reading this who does run nested
KVM and has managed to successfully live-migrate or managed-save? If
so, would you be able to share a working host kernel / L0 guest kernel
/ nested guest kernel combination, or any other hints for tuning the
L0 guest to support managed save and live migration?

Following up from our IRC discussion (on #kvm, Freenode). Re-posting my
So I just did a test of 'managedsave' (which is just "save the state of
the running VM to a file" in libvirt parlance) of L1, _while_ L2 is
running, and I seem to reproduce your case (see the call trace
attached).
# Ensure L2 (the nested guest) is running on L1. Then, from L0, do
[L0] $ virsh managedsave L1
[L0] $ virsh start L1 --console
Result: See the call trace attached to this bug. But L1 goes on to
start "fine", and L2 keeps running, too. But things start to seem
weird. As in: I try to safely, read-only mount the L2 disk image via
libguestfs (by setting export LIBGUESTFS_BACKEND=direct, which uses
direct QEMU): `guestfish --ro -a -i ./cirros.qcow2`. It throws the call
trace again on the L1 serial console. And the `guestfish` command just
sits there forever
- L0 (bare metal) Kernel: 4.13.13-300.fc27.x86_64+debug
- L1 (guest hypervisor) kernel: 4.11.10-300.fc26.x86_64
- L2 is a CirrOS 3.5 image
I can reproduce this at least 3 times, with the above versions.
I'm using libvirt 'host-passthrough' for CPU (meaning: '-cpu host' in
QEMU parlance) for both L1 and L2.
Thoughts?

Sounds like a similar problem as in
https://bugzilla.kernel.org/show_bug.cgi?id=198621

In short: there is no (live) migration support for nested VMX yet. So as
soon as your guest is using VMX itself ("nVMX"), this is not expected to
work.

--
Thanks,

David / dhildenb

Florian Haas

2018-02-08 08:19:17 UTC