Discussion:
[libvirt-users] issue with openssh-server running in a libvirt based centos virtual machine
Adrian Pascalau
2018-01-27 13:13:14 UTC
Permalink
Hi,

I have a strange issue in a libvirt environment, and I do not know how
to solve it.

I have two centos hosts: first one is a physical server called
server1, that acts as a host for the second one, called centos1. The
centos1 is a virtual machine (VM) running in server1. A linux bridge
in forwarding mode is used to connect the centos1 VM network interface
to the server1 network interface and to the external network. The
centos1 VM and the linux bridge are managed with libvirt (well, the
bridge itself in this case is created manually).

# virsh net-dumpxml br0
<network connections='1'>
<name>br0</name>
<uuid>5aaf72a5-023d-4b84-9d7c-d68b0918f620</uuid>
<forward mode='bridge'/>
<bridge name='br0'/>
</network>

# brctl show
bridge name bridge id STP enabled interfaces
br0 8000.fc15b4137688 no eno1
vnet0

Both server1 and centos1 have IP addresses in the same subnet, and
both are reachable with ping from every other host in my network. In
both server1 and centos1, the openssh-server configuration in
/etc/ssh/sshd_config is the default one, and has not been changed.

When I ssh with Putty to the physical server server1 IP address,
everything works as expected: I get a login prompt, I enter my
password and I log in.

However, when I use Putty to connect to the centos1 VM, I do not get a
login prompt whatsoever. So I think there might be some issue in
between the server1 physical interface and my centos VM.

I used openssh-server in debug mode, so see where the ssh connection
hangs, and here is what I get:

[...]
debug1: Server will not fork when running in debugging mode.
debug1: rexec start in 5 out 5 newsock 5 pipe -1 sock 8
debug1: sshd version OpenSSH_7.4, OpenSSL 1.0.2k-fips 26 Jan 2017
debug1: private host key #0: ssh-rsa
SHA256:pEuFQsodwK+0PoRzbVRba1ahHLEpwp8DG2KGQmxOGJk
debug1: private host key #1: ecdsa-sha2-nistp256
SHA256:F6HrSNWZhYaU7LMweI+RBviqTCHcTYyMBGPDz5OjT4c
debug1: private host key #2: ssh-ed25519
SHA256:aG3V6jjPHXUnNeavbxT/xozqrb5q3yWDkkAmXBCdnGk
debug1: inetd sockets after dupping: 3, 3
Connection from x.x.x.181 port 49436 on x.x.x.115 port 22
debug1: Client protocol version 2.0; client software version PuTTY_Release_0.70
debug1: no match: PuTTY_Release_0.70
debug1: Local version string SSH-2.0-OpenSSH_7.4
debug1: Enabling compatibility mode for protocol 2.0
debug1: SELinux support enabled [preauth]
debug1: permanently_set_uid: 74/74 [preauth]
debug1: list_hostkey_types:
ssh-rsa,rsa-sha2-512,rsa-sha2-256,ecdsa-sha2-nistp256,ssh-ed25519
[preauth]
debug1: SSH2_MSG_KEXINIT sent [preauth]

I tried with other windows based ssh clients (MobaXterm) and the same
issue happens. I discussed this with people in the openssh mailing
list, and they said this issue could most probably be caused by a path
MTU/fragmentation problem...

Then I moved my centos1 qcow2 image in another physical server called
server2, with exactly the same hw specs and network connections, where
I have installed an all-in-one OpenStack Pike. The network would be
managed with neutron in this case, however I have configured neutron
exactly so that the centos1 VM interface connects through a linux
bridge (managed by neutron) to the server1 physical network interface,
like in the libvirt case.

# brctl show
bridge name bridge id STP enabled interfaces
brqa13eec69-a4 8000.0e7faabad6d4 no eno1
tap8cb53db0-fb
tapb24a1cc5-20

Above the tapb24a1cc5-20 is the tap interface towards my centos1 VM.

In this case, the Putty issue is gone, and I do not have any issue
anymore. If I go back to the libvirt environment in server1, I get the
same issue again.

So I tend to think that my ssh connection issue is caused by the
libvirt and the way networking is configured, however I do not know
how to troubleshoot this further anymore.

Any help is greatly appreciated.

Adrian
Peter Crowther
2018-01-27 13:44:57 UTC
Permalink
You say you can ping but not ssh. If you install tcpdump on the VM, can you
see the ping packets arriving and leaving? If not, I suspect an address
collision - especially if ping continues to work with the VM shut down. If
you can't ping, check the other end of your bridge. I'm more familiar with
open vSwitch, but I'm somewhat concerned that your bridge definition
doesn't include a physical NIC as one of its connections.

Peter

On 27 Jan 2018 1:13 p.m., "Adrian Pascalau" <***@gmail.com>
wrote:

Hi,

I have a strange issue in a libvirt environment, and I do not know how
to solve it.

I have two centos hosts: first one is a physical server called
server1, that acts as a host for the second one, called centos1. The
centos1 is a virtual machine (VM) running in server1. A linux bridge
in forwarding mode is used to connect the centos1 VM network interface
to the server1 network interface and to the external network. The
centos1 VM and the linux bridge are managed with libvirt (well, the
bridge itself in this case is created manually).

# virsh net-dumpxml br0
<network connections='1'>
<name>br0</name>
<uuid>5aaf72a5-023d-4b84-9d7c-d68b0918f620</uuid>
<forward mode='bridge'/>
<bridge name='br0'/>
</network>

# brctl show
bridge name bridge id STP enabled interfaces
br0 8000.fc15b4137688 no eno1
vnet0

Both server1 and centos1 have IP addresses in the same subnet, and
both are reachable with ping from every other host in my network. In
both server1 and centos1, the openssh-server configuration in
/etc/ssh/sshd_config is the default one, and has not been changed.

When I ssh with Putty to the physical server server1 IP address,
everything works as expected: I get a login prompt, I enter my
password and I log in.

However, when I use Putty to connect to the centos1 VM, I do not get a
login prompt whatsoever. So I think there might be some issue in
between the server1 physical interface and my centos VM.

I used openssh-server in debug mode, so see where the ssh connection
hangs, and here is what I get:

[...]
debug1: Server will not fork when running in debugging mode.
debug1: rexec start in 5 out 5 newsock 5 pipe -1 sock 8
debug1: sshd version OpenSSH_7.4, OpenSSL 1.0.2k-fips 26 Jan 2017
debug1: private host key #0: ssh-rsa
SHA256:pEuFQsodwK+0PoRzbVRba1ahHLEpwp8DG2KGQmxOGJk
debug1: private host key #1: ecdsa-sha2-nistp256
SHA256:F6HrSNWZhYaU7LMweI+RBviqTCHcTYyMBGPDz5OjT4c
debug1: private host key #2: ssh-ed25519
SHA256:aG3V6jjPHXUnNeavbxT/xozqrb5q3yWDkkAmXBCdnGk
debug1: inetd sockets after dupping: 3, 3
Connection from x.x.x.181 port 49436 on x.x.x.115 port 22
debug1: Client protocol version 2.0; client software version
PuTTY_Release_0.70
debug1: no match: PuTTY_Release_0.70
debug1: Local version string SSH-2.0-OpenSSH_7.4
debug1: Enabling compatibility mode for protocol 2.0
debug1: SELinux support enabled [preauth]
debug1: permanently_set_uid: 74/74 [preauth]
debug1: list_hostkey_types:
ssh-rsa,rsa-sha2-512,rsa-sha2-256,ecdsa-sha2-nistp256,ssh-ed25519
[preauth]
debug1: SSH2_MSG_KEXINIT sent [preauth]

I tried with other windows based ssh clients (MobaXterm) and the same
issue happens. I discussed this with people in the openssh mailing
list, and they said this issue could most probably be caused by a path
MTU/fragmentation problem...

Then I moved my centos1 qcow2 image in another physical server called
server2, with exactly the same hw specs and network connections, where
I have installed an all-in-one OpenStack Pike. The network would be
managed with neutron in this case, however I have configured neutron
exactly so that the centos1 VM interface connects through a linux
bridge (managed by neutron) to the server1 physical network interface,
like in the libvirt case.

# brctl show
bridge name bridge id STP enabled interfaces
brqa13eec69-a4 8000.0e7faabad6d4 no eno1
tap8cb53db0-fb
tapb24a1cc5-20

Above the tapb24a1cc5-20 is the tap interface towards my centos1 VM.

In this case, the Putty issue is gone, and I do not have any issue
anymore. If I go back to the libvirt environment in server1, I get the
same issue again.

So I tend to think that my ssh connection issue is caused by the
libvirt and the way networking is configured, however I do not know
how to troubleshoot this further anymore.

Any help is greatly appreciated.

Adrian
Adrian Pascalau
2018-01-27 14:35:35 UTC
Permalink
On Sat, Jan 27, 2018 at 3:44 PM, Peter Crowther
Post by Peter Crowther
You say you can ping but not ssh. If you install tcpdump on the VM, can you
see the ping packets arriving and leaving? If not, I suspect an address
collision - especially if ping continues to work with the VM shut down. If
you can't ping, check the other end of your bridge. I'm more familiar with
open vSwitch, but I'm somewhat concerned that your bridge definition doesn't
include a physical NIC as one of its connections.
Peter, thanks for your reply. Yes, I see the icmp request coming into
the cnetos1 VM and the icmp reply going out. I am sure this is not an
ip address collision.

The bridge in the server1 libvirt environment is created like this:

# cat /etc/sysconfig/network-scripts/ifcfg-eno1
DEVICE=eno1
BOOTPROTO=none
BRIDGE=br0
ONBOOT=YES

# cat /etc/sysconfig/network-scripts/ifcfg-br0
DEVICE=br0
TYPE=Bridge
BOOTPROTO=static
IPADDR=x.x.219.54
NETMASK=255.255.255.0
GATEWAY=x.x.219.1
ONBOOT=YES

The result of the above is the following:
# brctl show
bridge name bridge id STP enabled interfaces
br0 8000.fc15b4137688 no eno1

Then I define the above br0 bridge in libvirt, like below:

# virsh net-dumpxml br0
<network>
<name>br0</name>
<uuid>5aaf72a5-023d-4b84-9d7c-d68b0918f620</uuid>
<forward mode='bridge'/>
<bridge name='br0'/>
</network>

# virsh net-list
Name State Autostart Persistent
----------------------------------------------------------
br0 active no yes

As soon as I have the br0 bridge defined in libvirt, I start the
centos1 VM, that has eth0 interface connected to this br0 bridge:

# virsh dumpxml centos1
[...]
<interface type='network'>
<mac address='52:54:00:40:31:85'/>
<source network='br0'/>
<model type='e1000'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x03'
function='0x0'/>
</interface>
[...]

# brctl show
bridge name bridge id STP enabled interfaces
br0 8000.fc15b4137688 no eno1
vnet0

And that is all. With this setup I have the centos1 VM interface eth0
directly connected to the br0 bridge through the vnet0 tap interface.
The br0 bridge is also connected to the eno1 physical interface in
server1, so my centos1 VM should be accessible to the outside world.

However, I have the ssh issue described in my initial email, while
ping is working. In the openssh-server debug log, I see the ssh
connection established and later hanging with the last debug message
being "debug1: SSH2_MSG_KEXINIT sent [preauth]".

Am I doing something wrong with my libvirt setup above?
Adrian Pascalau
2018-01-28 17:07:20 UTC
Permalink
On Sat, Jan 27, 2018 at 3:44 PM, Peter Crowther
Post by Peter Crowther
You say you can ping but not ssh. If you install tcpdump on the VM, can you
see the ping packets arriving and leaving? If not, I suspect an address
collision - especially if ping continues to work with the VM shut down. If
you can't ping, check the other end of your bridge. I'm more familiar with
open vSwitch, but I'm somewhat concerned that your bridge definition doesn't
include a physical NIC as one of its connections.
Ok, so I have investigated a bit further by doing some tcpdump and
wireshark traces, as you suggested, and here is what I have found:

When an Ethernet frame that is less then 60 bytes in size goes through
the network, it is padded with 0x00 bytes until it has 60 bytes in
length (64 with the frame check sequence). When this kind of padded
frames goes from centos1 VM through the linux bridge br0 to the
windows host, the IP and TCP headers in those frames wrongly consider
the 0x00 padded bytes as part of the user data, therefore the upstream
protocol (SSH in my case) tries to interpret them, and this is why
Putty hangs. Those 0x00 padded bytes are at the layer2 Ethernet frame
level, and should not be considered in the user data of the higher
level protocols.

About the padding bytes I have found some info here:
https://wiki.wireshark.org/Ethernet#Allowed_Packet_Lengths

The flow in my environment is like this:

[windows host]<---->[server1 host br0(eno1,vnet0)]<---->[eth0 centos1 VM]

All above hosts are in the same subnet, so no routers in between.
Server1 has the br0 linux bridge in forwarding mode that connects eno1
physical interface with the vnet0 tap interface. The vnet0 tap
interface is connected to the centos1 VM eth0 interface.

When I (1) ssh from the windows host to server1, no issue here. When I
(2) ssh from the same windows host to the centos1 VM, so I go through
the br0 bridge, I have this ssh issue I have mentioned. So I took
several tcpdump traces, and compared the working ones with the non
working ones, and this is the conclusion. So at this stage, everything
points to the linux bridge, since in the working scenario (1) those
0x00 padding bytes are left alone and not considered in the user data
of the IP and TCP protocols.
Adrian Pascalau
2018-01-29 09:44:21 UTC
Permalink
On Sun, Jan 28, 2018 at 7:07 PM, Adrian Pascalau
Post by Adrian Pascalau
When an Ethernet frame that is less then 60 bytes in size goes through
the network, it is padded with 0x00 bytes until it has 60 bytes in
length (64 with the frame check sequence). When this kind of padded
frames goes from centos1 VM through the linux bridge br0 to the
windows host, the IP and TCP headers in those frames wrongly consider
the 0x00 padded bytes as part of the user data, therefore the upstream
protocol (SSH in my case) tries to interpret them, and this is why
Putty hangs. Those 0x00 padded bytes are at the layer2 Ethernet frame
level, and should not be considered in the user data of the higher
level protocols.
Ok, so I found a workaround for this, even if I do not know who caused
this issue.

Basically I noticed that I have this ssh connection issue only when
the ssh client runs on a windows host. If the ssh client runs on a
linux host, the ssh connection works without any problem. So I have
compared the tcpdump for ssh connections initiated from both windows
and linux, and what I have noticed is that on centos linux, by default
the TCP stack uses timestamps in the TCP Options, and because of this,
the Ethernet frames are never below 60 bytes, while in my windows the
TCP Options timestamps are not used, and therefore some Ethernet
frames are less than 60 bytes.

So I enabled the TCP Options timestamps in windows as well, by running
the command 'netsh int tcp set global timestamps=enabled', and just
like that, the ssh started to work. Still, I do not know who is
causing this issue, and who to blame for this behavior...

Any suggestion how to identify which network element wrongly assigns
the Ethernet padding to the TCP payload is more than welcome.
Michal Privoznik
2018-01-29 13:56:45 UTC
Permalink
Post by Adrian Pascalau
On Sun, Jan 28, 2018 at 7:07 PM, Adrian Pascalau
Post by Adrian Pascalau
When an Ethernet frame that is less then 60 bytes in size goes through
the network, it is padded with 0x00 bytes until it has 60 bytes in
length (64 with the frame check sequence). When this kind of padded
frames goes from centos1 VM through the linux bridge br0 to the
windows host, the IP and TCP headers in those frames wrongly consider
the 0x00 padded bytes as part of the user data, therefore the upstream
protocol (SSH in my case) tries to interpret them, and this is why
Putty hangs. Those 0x00 padded bytes are at the layer2 Ethernet frame
level, and should not be considered in the user data of the higher
level protocols.
Ok, so I found a workaround for this, even if I do not know who caused
this issue.
Basically I noticed that I have this ssh connection issue only when
the ssh client runs on a windows host. If the ssh client runs on a
linux host, the ssh connection works without any problem. So I have
compared the tcpdump for ssh connections initiated from both windows
and linux, and what I have noticed is that on centos linux, by default
the TCP stack uses timestamps in the TCP Options, and because of this,
the Ethernet frames are never below 60 bytes, while in my windows the
TCP Options timestamps are not used, and therefore some Ethernet
frames are less than 60 bytes.
So I enabled the TCP Options timestamps in windows as well, by running
the command 'netsh int tcp set global timestamps=enabled', and just
like that, the ssh started to work. Still, I do not know who is
causing this issue, and who to blame for this behavior...
Any suggestion how to identify which network element wrongly assigns
the Ethernet padding to the TCP payload is more than welcome.
Since this is happening only on Windows and Putty, I'd suspect the
latter one. Does switching to different client (say winscp or Tera Term)
help?

Michal
Adrian Pascalau
2018-01-29 19:43:07 UTC
Permalink
Post by Michal Privoznik
Since this is happening only on Windows and Putty, I'd suspect the
latter one. Does switching to different client (say winscp or Tera Term)
help?
Exactly the same happens with WinSCP and the MobaXterm ssh client,
that is based on Cygwin.

Loading...