Discussion:
[libvirt-users] error : virHashForEach:597 : Hash operation not allowed during iteration
Vincent Bernat
2018-04-04 08:36:31 UTC
Permalink
Hey!

On many of our servers, we often have the following error:

error : virHashForEach:597 : Hash operation not allowed during iteration

When querying the list of domains, this means libvirt will silently
return 0 domain (most uses of virHashForEach don't handle the return
code).

We are using a hook, but the hook doesn't query libvirt back, so it
should be safe from this point of view. However, we have several sources
querying libvirt at regular interval (once every 5 seconds from two
different sources). I am unable to reproduce the problem with just
running "virsh list" in parallel.

We are using libvirt 3.6.

Is it an expected constraint to not be able to reliably query libvirt
from several sources? Have fixes for this problem been pushed since 3.6?
--
A kind of Batman of contemporary letters.
-- Philip Larkin on Anthony Burgess
Michal Privoznik
2018-04-04 09:17:46 UTC
Permalink
Post by Vincent Bernat
Hey!
error : virHashForEach:597 : Hash operation not allowed during iteration
When querying the list of domains, this means libvirt will silently
return 0 domain (most uses of virHashForEach don't handle the return
code).
We are using a hook, but the hook doesn't query libvirt back, so it
should be safe from this point of view. However, we have several sources
querying libvirt at regular interval (once every 5 seconds from two
different sources). I am unable to reproduce the problem with just
running "virsh list" in parallel.
We are using libvirt 3.6.
Is this plain 3.6.0 or do you have any patches on top of it (e.g. some
in-house patches)?
Post by Vincent Bernat
Is it an expected constraint to not be able to reliably query libvirt
from several sources? Have fixes for this problem been pushed since 3.6?
Dunno, this is the first time I hear about this issue. Maybe you can try
to set a break point on virHashIterationError() and when it's hit get
stacktrace of all threads 't a a bt'. That might shed more light into
the issue. Smells like we are not locking somewhere properly.

Also, you might want to try the latest release of 4.2.0 and see if the
bug still reproduces.

Michal
Vincent Bernat
2018-04-04 11:31:43 UTC
Permalink
Post by Michal Privoznik
Post by Vincent Bernat
We are using a hook, but the hook doesn't query libvirt back, so it
should be safe from this point of view. However, we have several sources
querying libvirt at regular interval (once every 5 seconds from two
different sources). I am unable to reproduce the problem with just
running "virsh list" in parallel.
We are using libvirt 3.6.
Is this plain 3.6.0 or do you have any patches on top of it (e.g. some
in-house patches)?
It's the one from Ubuntu Artful with some additional patches to support
Skylake and the IBRS CPU flag.
Post by Michal Privoznik
Post by Vincent Bernat
Is it an expected constraint to not be able to reliably query libvirt
from several sources? Have fixes for this problem been pushed since 3.6?
Dunno, this is the first time I hear about this issue. Maybe you can try
to set a break point on virHashIterationError() and when it's hit get
stacktrace of all threads 't a a bt'. That might shed more light into
the issue. Smells like we are not locking somewhere properly.
I'll try. It doesn't happen that often on a given host.
Post by Michal Privoznik
Also, you might want to try the latest release of 4.2.0 and see if the
bug still reproduces.
I'll try the 4.0 version which is the version to be in Ubuntu Bionic (I
am trying to not diverge from distribution to be able to get security
updates for free).
--
Choose variable names that won't be confused.
- The Elements of Programming Style (Kernighan & Plauger)
Vincent Bernat
2018-04-04 12:59:45 UTC
Permalink
Post by Michal Privoznik
Dunno, this is the first time I hear about this issue. Maybe you can try
to set a break point on virHashIterationError() and when it's hit get
stacktrace of all threads 't a a bt'. That might shed more light into
the issue. Smells like we are not locking somewhere properly.
So, we have two threads iterating. Both of them originates from
virNetServerProgramDispatchCall. So, it seems there is a lock missing?

Thread 10 (Thread 0x7f931f814700 (LWP 126453)):
#0 virHashForEach (table=0x7f92fc69a480, iter=***@entry=0x7f932ea8fbf0 <virDomainObjListCollectIterator>, data=***@entry=0x7f931f813a20) at ../../../src/util/virhash.c:597
#1 0x00007f932ea911c3 in virDomainObjListCollect (domlist=0x7f92fc82dd50, conn=***@entry=0x7f92e0000f80, vms=***@entry=0x7f931f813a80, nvms=***@entry=0x7f931f813a90, filter=0x7f932ebdc6b0 <virConnectListAllDomainsCheckACL>, flags=48)
at ../../../src/conf/virdomainobjlist.c:935
#2 0x00007f932ea91482 in virDomainObjListExport (domlist=<optimized out>, conn=0x7f92e0000f80, domains=0x7f931f813af0, filter=<optimized out>, flags=<optimized out>) at ../../../src/conf/virdomainobjlist.c:1019
#3 0x00007f932eaf3b75 in virConnectListAllDomains (conn=0x7f92e0000f80, domains=0x7f931f813af0, flags=48) at ../../../src/libvirt-domain.c:6491
#4 0x0000559f219f6dca in remoteDispatchConnectListAllDomains (server=0x559f23795530, msg=0x559f237b78c0, ret=0x7f92d0000920, args=0x7f92d0000ad0, rerr=0x7f931f813bc0, client=<optimized out>) at ../../../daemon/remote_dispatch.h:1244
#5 remoteDispatchConnectListAllDomainsHelper (server=0x559f23795530, client=<optimized out>, msg=0x559f237b78c0, rerr=0x7f931f813bc0, args=0x7f92d0000ad0, ret=0x7f92d0000920) at ../../../daemon/remote_dispatch.h:1220
#6 0x00007f932eb5c5b9 in virNetServerProgramDispatchCall (msg=0x559f237b78c0, client=0x559f237b7bd0, server=0x559f23795530, prog=0x559f237b2a50) at ../../../src/rpc/virnetserverprogram.c:437
#7 virNetServerProgramDispatch (prog=0x559f237b2a50, server=***@entry=0x559f23795530, client=0x559f237b7bd0, msg=0x559f237b78c0) at ../../../src/rpc/virnetserverprogram.c:307
#8 0x0000559f21a20268 in virNetServerProcessMsg (msg=<optimized out>, prog=<optimized out>, client=<optimized out>, srv=0x559f23795530) at ../../../src/rpc/virnetserver.c:148
#9 virNetServerHandleJob (jobOpaque=<optimized out>, opaque=0x559f23795530) at ../../../src/rpc/virnetserver.c:169
#10 0x00007f932ea3871b in virThreadPoolWorker (opaque=***@entry=0x559f2377da20) at ../../../src/util/virthreadpool.c:167
#11 0x00007f932ea37ac8 in virThreadHelper (data=<optimized out>) at ../../../src/util/virthread.c:206
#12 0x00007f932e0ed6ba in start_thread (arg=0x7f931f814700) at pthread_create.c:333
#13 0x00007f932de2341d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Thread 6 (Thread 0x7f9321818700 (LWP 126449)):
#0 __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1 0x00007f932e0efdbd in __GI___pthread_mutex_lock (mutex=***@entry=0x7f92ec011c80) at ../nptl/pthread_mutex_lock.c:80
#2 0x00007f932ea37c45 in virMutexLock (m=***@entry=0x7f92ec011c80) at ../../../src/util/virthread.c:89
#3 0x00007f932ea180ca in virObjectLock (anyobj=***@entry=0x7f92ec011c70) at ../../../src/util/virobject.c:388
#4 0x00007f932ea8fd81 in virDomainObjListCopyActiveIDs (payload=0x7f92ec011c70, name=<optimized out>, opaque=0x7f9321817aa0) at ../../../src/conf/virdomainobjlist.c:679
#5 0x00007f932e9ee6cd in virHashForEach (table=0x7f92fc69a480, iter=***@entry=0x7f932ea8fd70 <virDomainObjListCopyActiveIDs>, data=***@entry=0x7f9321817aa0) at ../../../src/util/virhash.c:606
#6 0x00007f932ea90fba in virDomainObjListGetActiveIDs (doms=0x7f92fc82dd50, ids=<optimized out>, maxids=<optimized out>, filter=<optimized out>, conn=<optimized out>) at ../../../src/conf/virdomainobjlist.c:701
#7 0x00007f932eae1c6e in virConnectListDomains (conn=0x7f92e40018b0, ids=0x7f92f4015ae0, maxids=14) at ../../../src/libvirt-domain.c:66
#8 0x0000559f21a0d3c1 in remoteDispatchConnectListDomains (server=<optimized out>, msg=0x559f237beea0, ret=0x7f92f4009eb0, args=0x7f92f4016290, rerr=0x7f9321817bc0, client=<optimized out>) at ../../../daemon/remote_dispatch.h:2108
#9 remoteDispatchConnectListDomainsHelper (server=<optimized out>, client=<optimized out>, msg=0x559f237beea0, rerr=0x7f9321817bc0, args=0x7f92f4016290, ret=0x7f92f4009eb0) at ../../../daemon/remote_dispatch.h:2076
#10 0x00007f932eb5c5b9 in virNetServerProgramDispatchCall (msg=0x559f237beea0, client=0x559f237c0ce0, server=0x559f23795530, prog=0x559f237b2a50) at ../../../src/rpc/virnetserverprogram.c:437
#11 virNetServerProgramDispatch (prog=0x559f237b2a50, server=***@entry=0x559f23795530, client=0x559f237c0ce0, msg=0x559f237beea0) at ../../../src/rpc/virnetserverprogram.c:307
#12 0x0000559f21a20268 in virNetServerProcessMsg (msg=<optimized out>, prog=<optimized out>, client=<optimized out>, srv=0x559f23795530) at ../../../src/rpc/virnetserver.c:148
#13 virNetServerHandleJob (jobOpaque=<optimized out>, opaque=0x559f23795530) at ../../../src/rpc/virnetserver.c:169
#14 0x00007f932ea3871b in virThreadPoolWorker (opaque=***@entry=0x559f2377d860) at ../../../src/util/virthreadpool.c:167
#15 0x00007f932ea37ac8 in virThreadHelper (data=<optimized out>) at ../../../src/util/virthread.c:206
#16 0x00007f932e0ed6ba in start_thread (arg=0x7f9321818700) at pthread_create.c:333
#17 0x00007f932de2341d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
Post by Michal Privoznik
Also, you might want to try the latest release of 4.2.0 and see if the
bug still reproduces.
Michal
--
Lay on, MacDuff, and curs'd be him who first cries, "Hold, enough!".
-- Shakespeare
Michal Privoznik
2018-04-04 13:19:03 UTC
Permalink
Post by Vincent Bernat
Post by Michal Privoznik
Dunno, this is the first time I hear about this issue. Maybe you can try
to set a break point on virHashIterationError() and when it's hit get
stacktrace of all threads 't a a bt'. That might shed more light into
the issue. Smells like we are not locking somewhere properly.
So, we have two threads iterating. Both of them originates from
virNetServerProgramDispatchCall. So, it seems there is a lock missing?
No. As suspected, both threads are reading internal list of domains
(actually hash table). They are not modifying the list so they grab a
read lock.
Post by Vincent Bernat
at ../../../src/conf/virdomainobjlist.c:935
#2 0x00007f932ea91482 in virDomainObjListExport (domlist=<optimized out>, conn=0x7f92e0000f80, domains=0x7f931f813af0, filter=<optimized out>, flags=<optimized out>) at ../../../src/conf/virdomainobjlist.c:1019
#3 0x00007f932eaf3b75 in virConnectListAllDomains (conn=0x7f92e0000f80, domains=0x7f931f813af0, flags=48) at ../../../src/libvirt-domain.c:6491
#4 0x0000559f219f6dca in remoteDispatchConnectListAllDomains (server=0x559f23795530, msg=0x559f237b78c0, ret=0x7f92d0000920, args=0x7f92d0000ad0, rerr=0x7f931f813bc0, client=<optimized out>) at ../../../daemon/remote_dispatch.h:1244
#5 remoteDispatchConnectListAllDomainsHelper (server=0x559f23795530, client=<optimized out>, msg=0x559f237b78c0, rerr=0x7f931f813bc0, args=0x7f92d0000ad0, ret=0x7f92d0000920) at ../../../daemon/remote_dispatch.h:1220
#6 0x00007f932eb5c5b9 in virNetServerProgramDispatchCall (msg=0x559f237b78c0, client=0x559f237b7bd0, server=0x559f23795530, prog=0x559f237b2a50) at ../../../src/rpc/virnetserverprogram.c:437
#8 0x0000559f21a20268 in virNetServerProcessMsg (msg=<optimized out>, prog=<optimized out>, client=<optimized out>, srv=0x559f23795530) at ../../../src/rpc/virnetserver.c:148
#9 virNetServerHandleJob (jobOpaque=<optimized out>, opaque=0x559f23795530) at ../../../src/rpc/virnetserver.c:169
#11 0x00007f932ea37ac8 in virThreadHelper (data=<optimized out>) at ../../../src/util/virthread.c:206
#12 0x00007f932e0ed6ba in start_thread (arg=0x7f931f814700) at pthread_create.c:333
#13 0x00007f932de2341d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
#0 __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#4 0x00007f932ea8fd81 in virDomainObjListCopyActiveIDs (payload=0x7f92ec011c70, name=<optimized out>, opaque=0x7f9321817aa0) at ../../../src/conf/virdomainobjlist.c:679
#6 0x00007f932ea90fba in virDomainObjListGetActiveIDs (doms=0x7f92fc82dd50, ids=<optimized out>, maxids=<optimized out>, filter=<optimized out>, conn=<optimized out>) at ../../../src/conf/virdomainobjlist.c:701
#7 0x00007f932eae1c6e in virConnectListDomains (conn=0x7f92e40018b0, ids=0x7f92f4015ae0, maxids=14) at ../../../src/libvirt-domain.c:66
#8 0x0000559f21a0d3c1 in remoteDispatchConnectListDomains (server=<optimized out>, msg=0x559f237beea0, ret=0x7f92f4009eb0, args=0x7f92f4016290, rerr=0x7f9321817bc0, client=<optimized out>) at ../../../daemon/remote_dispatch.h:2108
#9 remoteDispatchConnectListDomainsHelper (server=<optimized out>, client=<optimized out>, msg=0x559f237beea0, rerr=0x7f9321817bc0, args=0x7f92f4016290, ret=0x7f92f4009eb0) at ../../../daemon/remote_dispatch.h:2076
#10 0x00007f932eb5c5b9 in virNetServerProgramDispatchCall (msg=0x559f237beea0, client=0x559f237c0ce0, server=0x559f23795530, prog=0x559f237b2a50) at ../../../src/rpc/virnetserverprogram.c:437
#12 0x0000559f21a20268 in virNetServerProcessMsg (msg=<optimized out>, prog=<optimized out>, client=<optimized out>, srv=0x559f23795530) at ../../../src/rpc/virnetserver.c:148
#13 virNetServerHandleJob (jobOpaque=<optimized out>, opaque=0x559f23795530) at ../../../src/rpc/virnetserver.c:169
#15 0x00007f932ea37ac8 in virThreadHelper (data=<optimized out>) at ../../../src/util/virthread.c:206
#16 0x00007f932e0ed6ba in start_thread (arg=0x7f9321818700) at pthread_create.c:333
#17 0x00007f932de2341d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
Both threads call virHashForEach(table=0x7f92fc69a480). Thread 6 was
first so it starts iterating and sets table->iterating so later when
thread 10 enters the function an error is reported.

I guess we can go with what Dan suggested and after some rework we can
just drop ->iterating completely.

Michal
Vincent Bernat
2018-04-04 13:48:05 UTC
Permalink
Post by Michal Privoznik
Both threads call virHashForEach(table=0x7f92fc69a480). Thread 6 was
first so it starts iterating and sets table->iterating so later when
thread 10 enters the function an error is reported.
I guess we can go with what Dan suggested and after some rework we can
just drop ->iterating completely.
I may have missed this suggestion. Maybe Dan only sent it to you? In the
meantime, could I change the locks around virHashForEach() and similar
as read/write locks?
--
Nothing so needs reforming as other people's habits.
-- Mark Twain, "Pudd'nhead Wilson's Calendar"
Michal Privoznik
2018-04-04 13:59:11 UTC
Permalink
Post by Vincent Bernat
Post by Michal Privoznik
Both threads call virHashForEach(table=0x7f92fc69a480). Thread 6 was
first so it starts iterating and sets table->iterating so later when
thread 10 enters the function an error is reported.
I guess we can go with what Dan suggested and after some rework we can
just drop ->iterating completely.
I may have missed this suggestion. Maybe Dan only sent it to you?
No, there is another thread where this issue is discussed:

https://www.redhat.com/archives/libvir-list/2018-April/msg00190.html

In the
Post by Vincent Bernat
meantime, could I change the locks around virHashForEach() and similar
as read/write locks?
You can do that locally, but as a patch it's very unlikely to be
accepted upstream because we've introduced RW locks to be able to access
domain list from multiple threads.

Michal
Vincent Bernat
2018-04-04 14:21:21 UTC
Permalink
Post by Michal Privoznik
Post by Vincent Bernat
I may have missed this suggestion. Maybe Dan only sent it to you?
https://www.redhat.com/archives/libvir-list/2018-April/msg00190.html
In the
Post by Vincent Bernat
meantime, could I change the locks around virHashForEach() and similar
as read/write locks?
You can do that locally, but as a patch it's very unlikely to be
accepted upstream because we've introduced RW locks to be able to access
domain list from multiple threads.
Looking a bit more, the whole "iterating" bit is currently
unsafe. First, it is racy since it's "check then take". Two threads may
check the value is false and start iterating both. Moreover, at some
places, it is not set, for example in virHashAddOrUpdateEntry(). So, it
would be possible to call this function and during the update to iterate
over the hash. This seems to be prevented only by callers using RW
lock. So, it seems we can just remove this "iterating" bool and just
check all users of these functions are using the appropriate lock.
--
Make sure all variables are initialised before use.
- The Elements of Programming Style (Kernighan & Plauger)
Vincent Bernat
2018-04-04 15:00:48 UTC
Permalink
Post by Vincent Bernat
Post by Michal Privoznik
You can do that locally, but as a patch it's very unlikely to be
accepted upstream because we've introduced RW locks to be able to access
domain list from multiple threads.
Looking a bit more, the whole "iterating" bit is currently
unsafe. First, it is racy since it's "check then take". Two threads may
check the value is false and start iterating both. Moreover, at some
places, it is not set, for example in virHashAddOrUpdateEntry(). So, it
would be possible to call this function and during the update to iterate
over the hash. This seems to be prevented only by callers using RW
lock. So, it seems we can just remove this "iterating" bool and just
check all users of these functions are using the appropriate lock.
My bad, I didn't see the use of table->current...
--
The only way to keep your health is to eat what you don't want, drink what
you don't like, and do what you'd rather not.
-- Mark Twain
Vincent Bernat
2018-04-04 18:38:12 UTC
Permalink
Post by Vincent Bernat
Post by Vincent Bernat
Post by Michal Privoznik
You can do that locally, but as a patch it's very unlikely to be
accepted upstream because we've introduced RW locks to be able to access
domain list from multiple threads.
Looking a bit more, the whole "iterating" bit is currently
unsafe. First, it is racy since it's "check then take". Two threads may
check the value is false and start iterating both. Moreover, at some
places, it is not set, for example in virHashAddOrUpdateEntry(). So, it
would be possible to call this function and during the update to iterate
over the hash. This seems to be prevented only by callers using RW
lock. So, it seems we can just remove this "iterating" bool and just
check all users of these functions are using the appropriate lock.
My bad, I didn't see the use of table->current...
It seems it can also go away if caller correctly use a RW lock. I am
currently testing the attached patch on a few servers (patch for master,
but applied on our 3.6 version). After checking many locations, I came
to conclusion that locks are used correctly.
Michal Privoznik
2018-04-05 15:00:31 UTC
Permalink
0001-util-don-t-check-for-parallel-iteration-in-hash-rela.patch
From 7bf32825325b124edce58687a8cd2ceff13328b9 Mon Sep 17 00:00:00 2001
Date: Wed, 4 Apr 2018 16:25:37 +0200
Subject: [PATCH] util: don't check for parallel iteration in hash-related
functions
Can you send this as a regular patch? I'd like to review it.

Michal
Vincent Bernat
2018-04-05 15:55:21 UTC
Permalink
Post by Michal Privoznik
0001-util-don-t-check-for-parallel-iteration-in-hash-rela.patch
From 7bf32825325b124edce58687a8cd2ceff13328b9 Mon Sep 17 00:00:00 2001
Date: Wed, 4 Apr 2018 16:25:37 +0200
Subject: [PATCH] util: don't check for parallel iteration in hash-related
functions
Can you send this as a regular patch? I'd like to review it.
What do you mean by regular? Not an attachment? Through git send-email?

FI, I am now running this patch on 22 hosts without any detected issues.
--
Let the machine do the dirty work.
- The Elements of Programming Style (Kernighan & Plauger)
Michal Privoznik
2018-04-06 07:23:09 UTC
Permalink
Post by Vincent Bernat
Post by Michal Privoznik
0001-util-don-t-check-for-parallel-iteration-in-hash-rela.patch
From 7bf32825325b124edce58687a8cd2ceff13328b9 Mon Sep 17 00:00:00 2001
Date: Wed, 4 Apr 2018 16:25:37 +0200
Subject: [PATCH] util: don't check for parallel iteration in hash-related
functions
Can you send this as a regular patch? I'd like to review it.
What do you mean by regular? Not an attachment? Through git send-email?
Yes. git send-email.
Post by Vincent Bernat
FI, I am now running this patch on 22 hosts without any detected issues.
Oh that sounds good. We need to go through all callbacks and check
whether they are accessing the table though. This is going to be
slightly more time consuming, but it's feasible.

Michal

Loading...