fredizzimo / infinity_ergodox

Unofficial keyboard firmware for Input Club's Infinity Ergodox

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

The LCD screen gets corrupted after a while

fredizzimo opened this issue · comments

When you leave the keyboard on for a while, the graphics on the LCD screen can get corrupted.

I don't really know what's causing it, or what could be done to fix it, so ideas are welcome.

Do other people have the same problem?

It seems like not even putting the keyboard to sleep will fix the issue. In fact it seems like the sleep command is ignored and the graphics is still shown on the LCD.

I have the same issue. It starts with the LED backlight on the right half going dark, then the right LCD turns off. Finally, the LCD on the left half gets corrupted.

I have not personally observed the LED backlight going dark. But @tthayer reported a similar issue when he was testing the LED support #5.

But that problem makes the issue even more strange, since the two has nothing to do with each other, and are handled in completely different parts of the code.

Typically memory corruptions could cause issues like this, but since both systems are handled at completely different places, I have a hard time figuring out what variables could get corrupted and how.

An SWD debugger would be really useful. At least I think it could be used, but it seems like it would use the same pins as the interconnect, so I would only be able to debug with only one half connected. But currently I don't have any, and I don't know which one to buy, and from where. So I guess I will have to use print based debugging.

I'm working on getting a SWD debugger up and running. I was able to find a colleague who has a tagconnect and I have a debugger handy. I'm just working out the pin mapping from tagconnect to the debugger.

It looks like SWD_DIO and SWD_CLK are dedicated lines so there is no reason that debugging can't happen on both at the same time.

I'm seeing the LED backlight going dark on the right half, but not the corruption of the screen.

@bremoran, yes you are right. I believe the trace port would be quite useful for debugging this problem though. But it's connected to tx0, which is used for the UART communication.

Yes, if it's down to needing trace, that will be a problem, however, I am hoping to track the problem down with just SWD debugging.

When the right-hand LCD backlight goes dark, lcd_backlight_color() is called with (hue=64 '@', saturation=176 '\260', intensity=255 '\377'), however, current_brightness is set to 0, so there is no backlight.

lcd_backlight_brightness() is only called once, which points to a memory corruption bug.

The actual point where current_brightness gets set to 0 is in a memcpy()

Here is the traceback:

#0  0x000033bc in memcpy ()
#1  0x00008f32 in transport_recv_frame (from=<optimized out>, 
    data=data@entry=0x1fffa225 <states+5> "\001", size=size@entry=20)
    at ./tmk_serial_link/serial_link/protocol/transport.c:89
#2  0x00008d24 in route_incoming_frame (link=<optimized out>, 
    data=0x1fffa225 <states+5> "\001", size=<optimized out>)
    at ./tmk_serial_link/serial_link/protocol/frame_router.c:44
#3  0x00008dea in validator_recv_frame (link=<optimized out>, 
    data=<optimized out>, size=<optimized out>)
    at ./tmk_serial_link/serial_link/protocol/frame_validator.c:112
#4  0x00008c26 in byte_stuffer_recv_byte (link=link@entry=0 '\000', 
    data=<optimized out>)
    at ./tmk_serial_link/serial_link/protocol/byte_stuffer.c:74
#5  0x00009146 in read_from_serial (driver=<optimized out>, 
    link=link@entry=0 '\000')
    at ./tmk_serial_link/serial_link/system/system.c:68
#6  0x00009196 in serialThread (arg=<optimized out>)
    at ./tmk_serial_link/serial_link/system/system.c:101
#7  0x000022aa in _port_thread_start ()

In my build, current-brightness exists at 0x1fff86e4

I'm still working out the exact conditions for the failure, but it appears to be a flaw in transport_recv_frame()

Thanks a lot,

There's some seriously dangerous code there in the transport protocol, mostly because I'm used to relying on C++ and type safety. I also wanted to keep the system from allocating.

Anyway this information is probably enough for me to find the bug.

The issue occurs for id == 2 with ptr == 0x1fff86d4. Not sure if that helps

Let me know if you need more help with this and I'll take another look.

I'm not able to look into the issue properly right now, since I'm at work.

One thing that might help, is the map file in the build folder, which should specify where all variables are in memory, as I might not get the same memory locations if I build the code locally. At least I think that's file, but it could also be in some other file in that folder.

Ah, one other piece of information:

remote_objects[2] == 0x1fff8648 <remote_object_current_status> and ptr == 0x1fff86d4

This implies that it's something to do with

void * ptr = triple_buffer_begin_write_internal(obj->object_size, tb);

I generally prefer arm-none-eabi-objdump. Anyway, here's the map file
ch.map.txt

The pointer might not be exactly the same as the object pointer, since as the name implies, there's three different buffers, triple_buffer_begin_write_internal, is supposed to return a valid pointer to one of them.

So either it's not returning a wrong pointer, or the size is too big, so that it's writing over the buffer. It could also be some alignment issue, which causes the pointer to be slightly offset from the size that was allocated.

I have now manually verified that the pointer appears to be calculated correctly. However for some reason the size is 20, while it should be 17. The object in question is 16 bytes, and there's one additional byte for the object id.

So now I have to check why the received size is wrong.

I can confirm that I've gotten this context when there's a failure:

from = 0 '\000'
data = 0x1fffa225 <states+5> "\001"
size = 20, id = 2 '\002'
start = 0x1fff86b0 <remote_object_current_status+104> "\030"
tb = 0x1fff86b0 <remote_object_current_status+104>
ptr = 0x1fff86d4

As a temporary measure, I suggest only copying min(object->size,size)

Yes, I will add that sanity check in any case, even if I fix the bug. But I will probably just ignore the data if the size is wrong. The protocol is lossy any way, so it doesn't really matter.

I just added some print statements, and the sender sends the correct size. And for the objects the received size appears to be correct.

Unfortunately I can't add any printing on the right hand side which receives the object in question, so I don't know if the size is always wrong, or just sometimes.

In validator_recv_frame, I have:

link = 0 '\000', data = 0x1fffa225 <states+5> "\001", size = 25

data:

0x1fffa225 <states+5>:  0x01    0x00    0xff    0xa8    0x5c    0x81    0xd3    0x00
0x1fffa22d <states+13>: 0x00    0x00    0x00    0x00    0x00    0x00    0x00    0x00
0x1fffa235 <states+21>: 0x00    0x00    0x00    0x02    0xff    0x42    0x42    0x2f
0x1fffa23d <states+29>: 0x78

I haven't looked into the protocol much yet, so I'm not sure if that's useful information or not

I was able to confirm the bug on the left hand size. The size suddenly went from 18 to 39. But just for one packet.

So there's either some other data corruption, or the byte stuffer and crc check doesn't detect corrupted packets. They are both quite thoroughly unit tested though.

Is it possible that some lost bytes could produce a valid CRC?

It's certainly possible, the crc32 check is not a very strong one, but still the expected error rate is 2.3E-10. And add that we should expect very few errors on the actual physical link. And we are sending maybe 100 packets per second.

So I don't think we would see the error as often as we see it, if that was the case.

One thing that I noticed, is that it always gets corrupted to the same 39 value. So it's probably something else writing it there, rather than a problem with the link.

Moreover since I don't type anything, and the packet in question is the keyboard state, the data should stay the same. Which means that the byte stuffler should generate the same data to be sent over the link.

So it's probably something else corrupting it... Maybe we are only seeing the secondary cause of another memory corruption.

Is this on the left-hand side?

Yes, the left hand side (or the master), receives the keyboard_matrix object.

Ok, I found out that the link is very bad, or I have some bug in my stuffer code. It's constantly receiving the wrong sizes, and generates invalid CRC checks. So it's probably one of those checks that passes as you suspected.

Yes, that's indeed what's happening. The size is wrong, already at the CRC check. But the check passes. It's a little bit strange though, since statistically this shouldn't happen no where as often as it does.

For the bad link quality itself, the baud rate could be dropped. But on the other hand, I'm using the same speed as Haata claimed he had tested without any errors with the official firmware. Also he doesn't use any other checks than the parity bits, so if there are errors, we should probably see wrong characters being typed. I don't remember were he said that anymore.

Ok it turns out that

0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x01 0x01

and

0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x01 0x01 0x5A 0x9A 0xC9 0x61 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x01 0x01

generates the same CRC32 (0x61C99A5A). The data is basically duplicated, with the crc value as a part of the actual data. I think the fix for this would be to add the size as an additional criteria.

I still have to figure out why there's so much corruption though. It could be real physical corruption, but it might also be that there's trouble finding the start of the frame, or some bug in the byte stuffling code.

My suspicion is that you're not losing any data on the wire. I suspect that you're losing data by not reading it fast enough. You check for that by registering for the error UARTx_S1_OR in the serial driver's error field

Yes, I do indeed get overrun errors.

However when debugging this, I found out that I don't receive the events for CHN_INPUT_AVAILABLE at all (but I receive the error events from the same event handler). So there might be a bug in the Chibios serial driver, or I'm using it the wrong way.

This would mean that I'm only reading the data when the keyboard has something to send. So that would explain why so much data get lost.

I also noticed two other things,

  1. I'm only using 16 bytes for SERIAL_BUFFERS_SIZE (the internal Chibios buffer), this should probably be bigger.
  2. And I'm also running my serial thread with too low priority. It's currently running the same priority as the visualizer, which isn't good, since the visualizer is quite slow for some functionality. That must be something I have forgot to do, since I certainly had plans for having that thread priority order. The actual keyboard loop runs with higher priority as it should though, so it can handle the debouncing properly.

CHN_INPUT_AVAILABLE, works but it's only signaled when the queue goes from empty to non-empty state, which makes sense, and doesn't cause any problems with the existing code.

Right now I'm trying to figure out why there's still overrun errors(but much less) after fixing the thread priorities and increasing the buffer. It could be that there's simply too much to do with all visualization stuff going on. But it could also be something else.

In any case, if it turns out that there are just too many things going on, I will decrease the baud rate, to avoid getting errors.

I don't really know how to measure this, but I'm pretty sure that the problem is that the keyboard is generating way too many interrupts.

For example the ChibiOS serial driver for MK20DX256VLH7, does not use any DMA, and not even the hardware FIFO, which could be enabled for the UART. That means that there's an interrupt for every byte sent and received. The official firmware by Haata, uses both DMA, and also the hardware FIFO on the sending side, so that's probably why he is able to deal with the higher speeds.

Also add that we are using SPI to communicate with the LCD driver, which also doesn't use any DMA. Neither does the I2C driver that's used for the LEDs.

So there's a lot of things going on. I think I will have to ask if anyone could add DMA support for those on the ChibiOS forums. Because personally I would rather concentrate on improving the Visualization stuff, and also I have no microcontroller experience before this project.

Some time ago, I also noticed that the LCD updating was very slow, or basically that the pixel plotting took very long. I don't remember the exact times, but I think it was more than 10 ms, to just clear the screen. I didn't really understand the reason then, but I guess it's because of the number of interrupts.

Anyway, I will experiment a little bit with the interrupt priorities, and the speed of the serial link, to see if I get something acceptable. It doesn't really need to be that slow, since the debouncing is done separately on each half. And a matrix state is around 20 bytes. So as long as that can be sent within a ms or so, it should be good. I don't think anyone than maybe very hardcore gamers (perhaps those that play rhythmic based music games) would really notice a delay of even a few ms. Anyway I'll try to make it as fast as I can reliably.

It's likely possible to alter the ChibiOS drivers to add hardware FIFO support with minimal effort. DMA takes more work.

I have been checking it a bit further, and I don't really know what's going on.

I disabled the visualisation stuff, to have only the serial port(and the USB) active. But it still overflows. The software queue remains small, always containing less than 10 bytes, so it means that the overflow is caused by missed interrupts.

I also found that the sleep for 20 microseconds in the matrix scan was the biggest cause of drops. At least it seemed so when I added some prints to different places in the loop, it almost always happened during the sleep. I also get less dropped packets when I removed the sleep completely, at the cost of a non-functional keyboard. For those test I had the serial thread set with a higher priority to make sure it got a chance to process things.

So it appears like the chThdSleep disables the interrupts for long enough, to make some of them getting missed. However I'm not sure, since the profiling tools I have are so low tech, and the prints themselves could mess things up.

I haven't changed the baud rate from 1125000 yet.

Yes, adding hardware Fifo support doesn't seem like a huge task. But I would still like to understand what's going on before doing anything.

The matrix scan could probably also be written in a different way, using hardware timers, to avoid thread sleeping.But that wouldn't get rid of all the overflows.

I have now committed a fix. The serial link protocol checks that the data size matches. I also dropped the baud rate in half to 562500 and increased the software queue sizes, which seems to get rid of most overflows.

I still think that DMA should be used though, but that's another bug, and I will start discussions on the ChibiOS forums about that. Flabbergast already have plans for adding DMA to the I2C.

Thanks for looking into this.

@bremoran, I should thank you instead, you did the hard work of debugging the issue. Without your help I would not have found the issue, at least not this quickly. I doubt I would have guessed that the CRC checks would pass even if the data is not the same, and therefore simply reading the code would not be enough.

I guess I could have found it out by getting defensive, and adding checks to all memory writes. But that would still not have been easy, and add the fact that on my keyboard it often took many hours for the issue to occur, so testing one memory write at a time, would have been very time consuming.

@fredizzimo No problem. This sort of debugging happens a lot in my day job. I was able to find a tagconnect and a j-link, which made things much easier.

I hope that the ChibiOS folks are able to add a DMA variant of the serial driver; that will help a lot.

I'm re-opening this, as it seems like I now get the same issue in the QMK Firmware port, despite having the fix applied there(I have checked many times).

This leads me to believe that the issue never was actually completely fixed, or that there's a new kind of problem there.

It could also be related to the LED support(or the other way around, the LEDs are broken due to this), as according to #5, the LEDs seems to randomly work and randomly not.

Has anyone seen the issue since it was fixed?