BUG_ON in virtio-ring.c

Discussion:

Rusty Russell

2013-05-27 01:42:24 UTC

Hi Rusty,
current virtio-ring.c has a BUG_ON in virtqueue_add that checks
total_sg > vg->vring.num, however I'm not sure it really is 100%
correct.
If I have an indirect ring and I'm adding sgs to it and the host is
delayed (say I've got a thread consuming things from the vring and its
off doing something interesting),
I'd really like to get ENOSPC back from virtqueue_add. However if the
indirect addition fails due to free_sg being 0, we hit the BUG_ON
before we ever get to the ENOSPC check.

It is correct for the moment: drivers can't assume indirect buffer
support in the transport.

BUT for a new device, we could say "this depends on indirect descriptor
support", put the appropriate check in the device init, and then remove
the BUG_ON().

the BUG_ON is quite valid in the no indirect case, but when we have
indirect buffers it doesn't seem like it always makes sense.
Not sure best way to fix it, I'm just a virtio newbie :)

Mailing me and the list was the right thing, since this raises question
of the spec as well as the Linux implementation.

Good luck!
Rusty.

Rusty Russell

2013-05-27 05:38:25 UTC

Permalink

Post by Rusty Russell

correct.
If I have an indirect ring and I'm adding sgs to it and the host is
delayed (say I've got a thread consuming things from the vring and its
off doing something interesting),
I'd really like to get ENOSPC back from virtqueue_add. However if the
indirect addition fails due to free_sg being 0, we hit the BUG_ON
before we ever get to the ENOSPC check.

It is correct for the moment: drivers can't assume indirect buffer
support in the transport.
BUT for a new device, we could say "this depends on indirect descriptor
support", put the appropriate check in the device init, and then remove
the BUG_ON().

But if the transport has indirect buffer support, can it change its
mind at runtime?

It's a layering violation. The current rule is simple:

A driver should never submit a buffer which can't fit in the
ring.

This has the nice property that OOM (ie. indirect buffer alloction
fail) just slows things down, doesn't break things.

In this case we have vq->indirect set, but the device has run out of
free buffers,
but it isn't a case that in+out would overflow it if it had free
buffers since it would use
an indirect and succeed.

OK, but when do you re-xmit? What if the ring is empty, and you
submitted a buffer that needs indirect? You won't get interrupted
again, because the device has consumed all the buffers. You need to
have your own timer or something equally hackish.

Getting -ENOSPC is definitely what should happen from what I can see,
not a BUG_ON,
I should get a BUG_ON only if the device reports no indirect support.

I think we should return -ENOMEM if we can't indirect because of failed
allocation and it doesn't fit in the ring, ie:

This:
BUG_ON(total_sg > vq->vring.num);
BUG_ON(total_sg == 0);

if (vq->vq.num_free < total_sg) {
pr_debug("Can't add buf len %i - avail = %i\n",
total_sg, vq->vq.num_free);
/* FIXME: for historical reasons, we force a notify here if
* there are outgoing parts to the buffer. Presumably the
* host should service the ring ASAP. */
if (out_sgs)
vq->notify(&vq->vq);
END_USE(vq);
return -ENOSPC;
}

Becomes (untested):

BUG_ON(total_sg == 0);

if (vq->vq.num_free < total_sg) {
if (total_sg > vq->vring.num) {
BUG_ON(!vq->indirect);
/* Return -ENOMEM only if we have nothing else to do */
if (vq->vq.num_free == vq->vring.num)
return -ENOMEM;
}
pr_debug("Can't add buf len %i - avail = %i\n",
total_sg, vq->vq.num_free);
/* FIXME: for historical reasons, we force a notify here if
* there are outgoing parts to the buffer. Presumably the
* host should service the ring ASAP. */
if (out_sgs)
vq->notify(&vq->vq);
END_USE(vq);
return -ENOSPC;
}

Cheers,
Rusty.

Alexey Lapitsky

2014-10-07 15:46:56 UTC

Permalink

Hi,

I'm hitting this bug with both ext4 and btrfs.

Here's an example of the backtrace:
https://gist.github.com/vzctl/e888a821333979120932

I tried raising this BUG only for direct ring and it solved the problem:

- BUG_ON(total_sg > vq->vring.num);
+ BUG_ON(total_sg > vq->vring.num && !vq->indirect);

Shall I submit the patch or is a more elaborate fix required?

--

Alexey

Post by Rusty Russell

It is correct for the moment: drivers can't assume indirect buffer
support in the transport.
BUT for a new device, we could say "this depends on indirect descriptor
support", put the appropriate check in the device init, and then remove
the BUG_ON().

But if the transport has indirect buffer support, can it change its
mind at runtime?

A driver should never submit a buffer which can't fit in the
ring.
This has the nice property that OOM (ie. indirect buffer alloction
fail) just slows things down, doesn't break things.

In this case we have vq->indirect set, but the device has run out of
free buffers,
but it isn't a case that in+out would overflow it if it had free
buffers since it would use
an indirect and succeed.

Getting -ENOSPC is definitely what should happen from what I can see,
not a BUG_ON,
I should get a BUG_ON only if the device reports no indirect support.

I think we should return -ENOMEM if we can't indirect because of failed
BUG_ON(total_sg > vq->vring.num);
BUG_ON(total_sg == 0);
if (vq->vq.num_free < total_sg) {
pr_debug("Can't add buf len %i - avail = %i\n",
total_sg, vq->vq.num_free);
/* FIXME: for historical reasons, we force a notify here if
* there are outgoing parts to the buffer. Presumably the
* host should service the ring ASAP. */
if (out_sgs)
vq->notify(&vq->vq);
END_USE(vq);
return -ENOSPC;
}
BUG_ON(total_sg == 0);
if (vq->vq.num_free < total_sg) {
if (total_sg > vq->vring.num) {
BUG_ON(!vq->indirect);
/* Return -ENOMEM only if we have nothing else to do */
if (vq->vq.num_free == vq->vring.num)
return -ENOMEM;
}
pr_debug("Can't add buf len %i - avail = %i\n",
total_sg, vq->vq.num_free);
/* FIXME: for historical reasons, we force a notify here if
* there are outgoing parts to the buffer. Presumably the
* host should service the ring ASAP. */
if (out_sgs)
vq->notify(&vq->vq);
END_USE(vq);
return -ENOSPC;
}
Cheers,
Rusty.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo at vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Rusty Russell

2014-10-13 05:39:18 UTC

Permalink

Post by Alexey Lapitsky
Hi,
I'm hitting this bug with both ext4 and btrfs.
https://gist.github.com/vzctl/e888a821333979120932
- BUG_ON(total_sg > vq->vring.num);
+ BUG_ON(total_sg > vq->vring.num && !vq->indirect);
Shall I submit the patch or is a more elaborate fix required?

This is wrong. It looks like the block layer is spending down
more sg elements than we have ring entries, yet we tell it not to:

blk_queue_max_segments(q, vblk->sg_elems-2);

If you apply this patch, what happens? Here it prints:

[ 0.616564] virtqueue elements = 128, max_segments = 126 (1 queues)
[ 0.621244] vda: vda1 vda2 < vda5 >
[ 0.632290] virtqueue elements = 128, max_segments = 126 (1 queues)
[ 0.683526] vdb: vdb1 vdb2 < vdb5 >

Cheers,
Rusty.

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 0a581400de0f..aa9d4d313587 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -683,6 +683,13 @@ static int virtblk_probe(struct virtio_device *vdev)
/* We can handle whatever the host told us to handle. */
blk_queue_max_segments(q, vblk->sg_elems-2);

+ printk("virtqueue elements = %u, max_segments = %u (%u queues)",
+ virtqueue_get_vring_size(vblk->vqs[0].vq),
+ vblk->sg_elems-2,
+ vblk->num_vqs);
+
+ BUG_ON(vblk->sg_elems-2 > virtqueue_get_vring_size(vblk->vqs[0].vq));
+
/* No need to bounce any requests */
blk_queue_bounce_limit(q, BLK_BOUNCE_ANY);

Dave Airlie

2013-05-27 03:29:50 UTC

Permalink

Post by Rusty Russell

It is correct for the moment: drivers can't assume indirect buffer
support in the transport.
BUT for a new device, we could say "this depends on indirect descriptor
support", put the appropriate check in the device init, and then remove
the BUG_ON().

But if the transport has indirect buffer support, can it change its
mind at runtime?

In this case we have vq->indirect set, but the device has run out of
free buffers,
but it isn't a case that in+out would overflow it if it had free
buffers since it would use
an indirect and succeed.

Getting -ENOSPC is definitely what should happen from what I can see,
not a BUG_ON,
I should get a BUG_ON only if the device reports no indirect support.

Dave.