Dies with free(): invalid pointer after ~107.4M calls to xfer()

Question

Dies with free(): invalid pointer after ~107.4M calls to xfer()

HowardALandman opened this issue 4 years ago · comments

The attached program, which calls xfer() repeatedly in read mode, dies with free(): invalid pointer after about 107.4 million calls to xfer(). It is a simplified version of code in the read_regs24() method in my tdc7201 library. I'm typically running in Python 3.7.3.

This can be hard to debug because abort() doesn't return from the C level, and so it's impossible to catch the SIGABRT in Python. Also, the spidev library doesn't work with python3-dbg (which might be a separate bug). I recommend running it under gdb with a break in malloc_printerr(). I suppose valgrind might also be helpful if you have the patience. Here's what I get from gdb:

Breakpoint 1, malloc_printerr (str=0x76e028f8 "free(): invalid pointer")
at malloc.c:5341
5341 malloc.c: No such file or directory.
(gdb) bt
#0 malloc_printerr (str=0x76e028f8 "free(): invalid pointer") at malloc.c:5341
#1 0x76d44d50 in _int_free (av=0x76e1f7d4 <main_arena>,
p=0x43417c <small_ints.lto_priv+72>, have_lock=<optimized out>) at malloc.c:4165
#2 0x001b4f40 in list_dealloc (op=0x766d0f58) at ../Objects/listobject.c:324
#3 0x7669b660 in ?? ()
from /usr/lib/python3/dist-packages/spidev.cpython-37m-arm-linux-gnueabihf.so
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

On a RPi4 this takes about 57 minutes to hit the bug (usually in batch -1074). But the bug also occurs on a RPi3B+ or a RPi0W. It occurs under both Stretch and Buster (32-bit).

It looks like the error is trying to free the integer 13 more times than it was allocated (<small_ints.lto_priv+72>).

bug_spi2.py.gz

Philip Howard · Answer 1 · Tue Aug 04 2020 18:33:27 GMT+0800 (China Standard Time)

This is possibly another case of the Py_DECREF issue I found under similar circumstances- #96 - but make sure you're running the very latest release of spidev since some of these issues have been fixed.

I typically never use a tuple when interacting with this library, however, so my tests for stability would have overlooked any tuple-specific bugs.

Here's xfer with the unused SPIDEV_SINGLE and Python version < 3 code branches pruned and some additional comments added. Since I'm useless with valgrind and other leak tracing techniques I usually use intuition and a thorough code review to track these things down:

static PyObject *
SpiDev_xfer(SpiDevObject *self, PyObject *args)
{
	uint16_t ii, len;
	int status;
	uint16_t delay_usecs = 0;
	uint32_t speed_hz = 0;
	uint8_t bits_per_word = 0;
	PyObject *obj;
	PyObject *seq;
	struct spi_ioc_transfer xfer;
	memset(&xfer, 0, sizeof(xfer));
	uint8_t *txbuf, *rxbuf;
	char	wrmsg_text[4096];

	if (!PyArg_ParseTuple(args, "O|IHB:xfer", &obj, &speed_hz, &delay_usecs, &bits_per_word))
		return NULL;

	seq = PySequence_Fast(obj, "expected a sequence");       // seq = New reference
	if (!seq) {
		PyErr_SetString(PyExc_TypeError, wrmsg_list0);
		return NULL;
	}

	len = PySequence_Fast_GET_SIZE(seq);
	if (len <= 0) {
		Py_DECREF(seq);                                                       // seq dec'd
		PyErr_SetString(PyExc_TypeError, wrmsg_list0);
		return NULL;
	}

	if (len > SPIDEV_MAXPATH) {
		snprintf(wrmsg_text, sizeof(wrmsg_text) - 1, wrmsg_listmax, SPIDEV_MAXPATH);
		PyErr_SetString(PyExc_OverflowError, wrmsg_text);
		Py_DECREF(seq);                                                       // seq dec'd
		return NULL;
	}

	txbuf = malloc(sizeof(__u8) * len);
	rxbuf = malloc(sizeof(__u8) * len);

	for (ii = 0; ii < len; ii++) {
		PyObject *val = PySequence_Fast_GET_ITEM(seq, ii);
		{
			if (PyLong_Check(val)) {
				txbuf[ii] = (__u8)PyLong_AS_LONG(val);
			} else {
				snprintf(wrmsg_text, sizeof(wrmsg_text) - 1, wrmsg_val, val);
				PyErr_SetString(PyExc_TypeError, wrmsg_text);
				free(txbuf);
				free(rxbuf);
				Py_DECREF(seq);
				return NULL;
			}
		}
	}

        // If the input object is a tuple, then "seq" will also be a tuple
        // Copy it to a list here for convenience since the code below expects it to be mutable
	if (PyTuple_Check(obj)) {
		Py_DECREF(seq);
		seq = PySequence_List(obj);
	}

	xfer.tx_buf = (unsigned long)txbuf;
	xfer.rx_buf = (unsigned long)rxbuf;
	xfer.len = len;
	xfer.delay_usecs = delay_usecs;
	xfer.speed_hz = speed_hz ? speed_hz : self->max_speed_hz;
	xfer.bits_per_word = bits_per_word ? bits_per_word : self->bits_per_word;
#ifdef SPI_IOC_WR_MODE32
	xfer.tx_nbits = 0;
#endif
#ifdef SPI_IOC_RD_MODE32
	xfer.rx_nbits = 0;
#endif

	status = ioctl(self->fd, SPI_IOC_MESSAGE(1), &xfer);
	if (status < 0) {
		PyErr_SetFromErrno(PyExc_IOError);
		free(txbuf);
		free(rxbuf);
		Py_DECREF(seq);
		return NULL;
	}

	for (ii = 0; ii < len; ii++) {
		PyObject *val = PyLong_FromLong((long)rxbuf[ii]);  // val = New reference
		PySequence_SetItem(seq, ii, val);                              // list item gets new reference, does not steal
		Py_DECREF(val);                                                        // PySequence_SetItem does not steal reference, must Py_DECREF(val)
	}

	// WA:
	// in CS_HIGH mode CS isn't pulled to low after transfer, but after read
	// reading 0 bytes doesnt matter but brings cs down
	// tomdean:
	// Stop generating an extra CS except in mode CS_HOGH
	if (self->mode & SPI_CS_HIGH) status = read(self->fd, &rxbuf[0], 0);

	free(txbuf);
	free(rxbuf);

        // In cases where the input "obj" is a tuple, a tuple is returned
	if (PyTuple_Check(obj)) {
                // Gymnatics to decrement the reference count for seq, but still convert it to a tuple
                // It's possible that this is failing and we could probably stand to rewrite this to something simpler

                // Is this creating a new reference and then immediately shadowing it with seq? giving us a leak?
		PyObject *old = seq;
		seq = PySequence_Tuple(seq);
		Py_DECREF(old);
	}

	return seq;
}

The PyTuple_Check code branch looks deeply suspicious to me, and I'd be tempted to rewrite it into something like:

	if (PyTuple_Check(obj)) {
		PyObject *new;
		new = PySequence_Tuple(seq);
		Py_DECREF(seq);
                return new;
	}

Since I prefer to return early than try to mutate the return value.

I will run your code and see if I can hit the same error, then try my hunches and see what happens.

Howard A. Landman · Answer 2 · Tue Aug 04 2020 19:47:41 GMT+0800 (China Standard Time)

$ python3 -m pip show spidev says I have version 3.4 on all machines.

I don't think tuple-vs-list matters. In my original it was a list and the bug still happened (inside of list_ass_item()). I changed it to a tuple to see if that would fix it; it didn't.

Philip Howard · Answer 3 · Tue Aug 04 2020 19:50:58 GMT+0800 (China Standard Time)

I'm currently on batch = -57400001 which I assume is 57m calls. Running at 80MHz but wiggling the pins into thin air so it doesn't matter. Will see what I run into at the 107.4M mark.

Edit: now at batch = -118600001 - I deeply suspect this has already been fixed.

Howard A. Landman · Answer 4 · Tue Aug 04 2020 21:53:27 GMT+0800 (China Standard Time)

Nod, I upgraded to 3.5 and on the RPi4 it's passed 227M with no problem, and 118M on the RPi3B+. I'll let both machines run for a bit longer. If they both look good, I'll run my original full program on the RPi3B+ with the TDC7201 attached, but that'll take a day. And bump the tdc7201 module's setup.py to require spidev>=3.5.

Philip Howard · Answer 5 · Tue Aug 04 2020 23:20:33 GMT+0800 (China Standard Time)

Thanks and good luck! It's handy to have someone else helping bash the rough edges off this library.

Howard A. Landman · Answer 6 · Sat Aug 08 2020 20:46:44 GMT+0800 (China Standard Time)

Full program has run ~200M full measurement cycles without hitting this error, so I'm marking it fixed on my own software. :-)

Philip Howard · Answer 7 · Sun Aug 09 2020 05:39:03 GMT+0800 (China Standard Time)

Excellent! Thanks for testing so thoroughly. It's good to have some extra confidence on this, since I appreciate this library might have a.... few users 😆

Howard A. Landman · Answer 8 · Sun Aug 23 2020 06:42:50 GMT+0800 (China Standard Time)

Just to follow up ... my latest stress test has been running for over a week, and has executed over 1.6 billion measurement cycles with no sign of this bug. I am still seeing a few hardware errors, about 1 per 213 thousand measurements, but I don't think that's spidev's fault. And, considering that a few months ago it was 1 per thousand, I think I can live with those numbers for now. :-) Let me know if you ever want a beta version tested.