[libvirt-users] e1000 network interface takes a long time to set the link ready

Discussion:

Ihar Hrachyshka

2018-05-10 19:06:26 UTC

Hi,
try to use virtio instead...

That is exactly what I tried, and indeed the link is ready almost
immediately. But there are some issues with virtio, like missing
drivers in default Windows images. Kubevirt doesn't currently allow to
choose the type of the NIC (arguably it should), and its current
default type is e1000. Just switching the default to virtio won't fly
with Windows, so unless we expose the choice of the NIC type to
kubevirt users, they are affected by one issue (slow cirros boot) or
another (no networking in windows machines). While we are going to
explore exposing the choice in next kubevirt versions, it would be
nice to have cirros behave correctly regardless of NIC type.

Ihar

Laine Stump

2018-05-10 21:07:22 UTC

Permalink

Hi,
In kubevirt, we discovered [1] that whenever e1000 is used for vNIC,
link on the interface becomes ready several seconds after 'ifup' is
executed

What is your definition of "becomes ready"? Are you looking at the
output of "ip link show" in the guest? Or are you watching "brctl
showstp" for the bridge device on the host? Or something else?

which for some buggy images like cirros may slow down boot
process for up to 1 minute [2]. If we switch from e1000 to virtio, the
link is brought up and ready almost immediately.
- L0 kernel: 4.16.5-200.fc27.x86_64 #1 SMP
- libvirt: 3.7.0-4.fc27
- guest kernel: 4.4.0-28-generic #47-Ubuntu
Is there something specific about e1000 that makes it initialize the
link too slowly on libvirt or guest side?

There isn't anything libvirt could do that would cause the link to
IFF_UP up any faster or slower, so if there is an issue it's elsewhere.
Since switching to the virtio device eliminates the problem, my guess
would be that it's something about the implementation of the emulated
device in qemu that is causing a delay in the e1000 driver in the guest.
That's just a guess though.

[1] https://github.com/kubevirt/kubevirt/issues/936
[2] https://bugs.launchpad.net/cirros/+bug/1768955

(I discount the idea of the stp delay timer having an effect, as
suggested in one of the comments on github that points to my explanation
of STP in a libvirt bugzilla record, because that would cause the same
problem for e1000 or virtio).

I hesitate to suggest this, because the rtl8139 code in qemu is
considered less well maintained and lower performance than e1000, but
have you tried setting that model to see how it behaves? You may be
forced to make that the default when virtio isn't available.

Another thought - I guess the virtio driver in Cirros is always
available? Perhaps kubevirt could use libosinfo to auto-decide what
device to use for networking based on OS.

Ihar Hrachyshka

2018-05-10 21:44:14 UTC

Permalink

Post by Laine Stump

Hi,
In kubevirt, we discovered [1] that whenever e1000 is used for vNIC,
link on the interface becomes ready several seconds after 'ifup' is
executed

What is your definition of "becomes ready"? Are you looking at the
output of "ip link show" in the guest? Or are you watching "brctl
showstp" for the bridge device on the host? Or something else?

I was watching the guest dmesg for the following messages:

[ 4.773275] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[ 6.769235] e1000: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow
Control: RX
[ 6.771408] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready

For e1000, there are 2 seconds in between those messages; for virtio,
it's near instant. Interesting that it happens on the very first ifup;
when I do it the second time after the guest booted, it's instant.

Post by Laine Stump

[1] https://github.com/kubevirt/kubevirt/issues/936
[2] https://bugs.launchpad.net/cirros/+bug/1768955

Yes, it's not STP, and I also tried to explicitly set all bridge
timers to 0 with no result. I also did "tcpdump -i any" inside the
container that hosts the VM VIF, and there was no relevant traffic on
tap device.

Post by Laine Stump
I hesitate to suggest this, because the rtl8139 code in qemu is
considered less well maintained and lower performance than e1000, but
have you tried setting that model to see how it behaves? You may be
forced to make that the default when virtio isn't available.

Indeed rth8139 is near instant too:

[ 4.156872] 8139cp 0000:07:01.0 eth0: link up, 100Mbps,
full-duplex, lpa 0x05E1
[ 4.177520] 8139cp 0000:07:01.0 eth0: link up, 100Mbps,
full-duplex, lpa 0x05E1

Thanks for the tip, we will consider it too (also thanks for the
background info about the driver support state).

Post by Laine Stump
Another thought - I guess the virtio driver in Cirros is always
available? Perhaps kubevirt could use libosinfo to auto-decide what
device to use for networking based on OS.

This, or we can introduce explicit tags for NICs / guest type to use.

Thanks a lot for reply,
Ihar

Daniel P. Berrangé

2018-05-11 08:42:56 UTC

Permalink

Hi,
In kubevirt, we discovered [1] that whenever e1000 is used for vNIC,
link on the interface becomes ready several seconds after 'ifup' is
executed, which for some buggy images like cirros may slow down boot
process for up to 1 minute [2]. If we switch from e1000 to virtio, the
link is brought up and ready almost immediately.
- L0 kernel: 4.16.5-200.fc27.x86_64 #1 SMP
- libvirt: 3.7.0-4.fc27
- guest kernel: 4.4.0-28-generic #47-Ubuntu
Is there something specific about e1000 that makes it initialize the
link too slowly on libvirt or guest side?

Try the e1000e device instead perhaps.

If all other NIC models work, then this is likely to be a QEMU problem
and should be reported as a bug to them. I notice you're running Fedora 27
though, so before reporting bugs please try with latest upstream QEMU
releases (2.12) to see if that's better

[1] https://github.com/kubevirt/kubevirt/issues/936
[2] https://bugs.launchpad.net/cirros/+bug/1768955

Regards,
Daniel

--
|: https://berrange.com -o- https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o- https://fstop138.berrange.com :|
|: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

Ihar Hrachyshka

2018-05-11 18:34:25 UTC

Permalink

Try the e1000e device instead perhaps.

Thanks a lot for the suggestion, it works indeed. My understanding is
that it's the default NIC for q35 machines starting 2.12, so indeed
that's a great choice.

Post by Daniel P. BerrangÃ©
If all other NIC models work, then this is likely to be a QEMU problem
and should be reported as a bug to them. I notice you're running Fedora 27
though, so before reporting bugs please try with latest upstream QEMU
releases (2.12) to see if that's better

Thanks for the suggestion. I reported a bug here:
https://bugs.launchpad.net/qemu/+bug/1770724

I tried to reproduce it with 2.12 (built kubevirt stack with Fedora 29
packages) but I get some fundamental issues in the guest that block me
from reproducing the slow link ready bug (with the new qemu / libvirt
stack, I get kernel traces and irq interrupt error and no network link
at all in the guest). I hope my report against 2.10 would still fit
their bug report requirements.

Thanks again,
Ihar