kurt-vd / linux

Linux kernel source tree

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Large Broadcast Packets Lost

rpkvt opened this issue · comments

commented

Hello,
I'm having some trouble sending large packets using the broadcast protocol. I did modify your driver (4.9.10) to enable/disable the required 50 ms delay between packets.

You can see my patch here:
J1939 Delay Patch

With the delay enabled and disabled, larger packets seem to be over-written and/or not sent if the socket is written before previous packets are sent. The following screenshots are with the delay disabled (shortened).

Using sendto with MSG_SYN and fcntl(file, F_SETFL, O_NDELAY), it looks like the short packets (8 bytes) get out just fine but the larger packets are nowhere to be seen:
image

I can see the larger packets in /proc/../transport:
image

No errrors are reported by sendto:
image

I added sleeps after writing each packet to give it a chance to get out. And that seems to improve the situation:
image

I can see in my debug logs that the port returns EAGAIN after transmitting larger packets, which is fine. I just retransmit.
image

However, some time into running the large packets begin to stop coming through:
image

And it looks like I'm no longer getting the EAGAIN signal from the port:
image

Looking at the transport file in proc, I can see that packets are queuing:
image

Pausing the application (CRTL+Z and then fg) briefly or completely restarting the application seems to fix the issue temporarily.

Is there something I'm doing wrong? I was using an earlier version of your driver (3.8) which doesn't have this problem. I also modified that one to reduce the delay from 50 ms to 1 ms, but it's not configurable. It's hard coded.

I'm going to see if I can figure out what's going wrong but I think you'd be able to diagnose a lot quicker than I can so I'd appreciate it if you could weigh in.

Thanks!!!
Rich

P.S. This driver has been a huge help for me, btw. I greatly appreciate the work you and others have put in to make this available to everyone.

commented

By the way, dmesg starts outputting these messages when this happens:
image

This needs investigation, but may I remark that the latest can-next-j1939 branch has received more testing and has some of such problems resolved.
Can you try that.

commented

Sure thing! I'll let you know how it goes.

commented

Is there a kernel version you would recommend merging can-next-j1939 with? It looks like the branch is based off of 4.13-rc6; but I'd like to use something more stable. 4.13?

commented

Hi @kurt-vd ,

So, I did the merge, but I haven't tried it yet. It seems like there's quite a difference between 4.13 and 4.13-rc6.

Please see the attached patch:

j1939.txt

I merged it using the instructions here: https://elinux.org/J1939

And then created the patch using: git format-patch HEAD^ --stdout > j1939.patch

Do you know if there is an easier way to just merge your j1939 edits with 4.13? It looks like a significant portion of the patch is dedicated to merging unrelated code which is actually obsolete.

Thanks for the help,
Rich

commented

Do you think I can just do an ediff and remove the unrelated changes? For instance, delete everything before line 219164?

commented

Hi @kurt-vd,

I was able to merge can-next-j1939 with 4.13-rc6 and there appears to be several issues.

When setting up the device for j1939, I get:

root:# ip link set can0 up type can bitrate 250000 restart-ms 250
root:# ip link set can0 j1939 on
Error: either "dev" is duplicate, or "j1939" is a garbage.
root:# ip addr add dev can0 j1939 name 8000FF0047200000
Error: inet prefix is expected rather than "j1939".

I believe I received this error in 4.9.10 as well but it didn't seem to cause an issue. This is the device entry in ifconfig:

can0 Link encap:UNSPEC HWaddr 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
UP RUNNING NOARP MTU:16 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:10
RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)
Interrupt:166

Using the addressing daemon now seems to fail:

root:# jacd -v -r 0x80-0xCF 8000FF0047200000 can0 &
root:# jacd: ready for can0:8000ff0047200000
jacd: bind(): No such device
[1]+ Done(1) jacd -v -r 0x80-0xCF 8000FF0047200000 can0

This wasn't happening in 4.9.10.

So, I tried my program using a pre-assigned address (0x80) and I was able to get traffic over the bus. However it seemed like it was still dropping larger packets.

This is the dmesg printout when my program tries to send J1939 packets:

[ 2269.504851] j1939_priv_get_by_ifindex: ifindex=4
[ 2269.505327] j1939_priv_get_by_ifindex: ifindex=4
[ 2269.506769] j1939xtp_rx_rts: connection exists (4 80 ff)
[ 2269.512460] j1939xtp_rx_dat: last 00
[ 2269.514837] j1939_priv_get_by_ifindex: ifindex=4
[ 2269.518303] j1939xtp_rx_dat: last 00
[ 2269.518387] j1939xtp_rx_dat:no connection found
[ 2269.518409] j1939xtp_rx_dat:no connection found
[ 2269.518429] j1939xtp_rx_dat:no connection found
[ 2269.518450] j1939xtp_rx_dat:no connection found
[ 2269.518470] j1939xtp_rx_dat:no connection found
[ 2269.518489] j1939xtp_rx_dat:no connection found
[ 2269.518509] j1939xtp_rx_dat:no connection found
[ 2269.518528] j1939xtp_rx_dat:no connection found
[ 2269.518548] j1939xtp_rx_dat:no connection found
[ 2269.518568] j1939xtp_rx_dat:no connection found
[ 2269.518588] j1939xtp_rx_dat:no connection found
[ 2269.518608] j1939xtp_rx_dat:no connection found
[ 2269.518850] j1939xtp_rx_rts: I should tx (4 80 ff)

Do you have any thoughts on this?

commented

Hi Kurt,
Well, I think I've tried all of the kernel versions I can and now I am trying to debug the code myself. I know that version 3.8 works but I have to use kernel version >4 due to some other requirements.

I'm working out of: git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next.git
Branch: j1939

That said, would you possibly help me a bit and describe the program flow? The problem I'm debugging is that large ( > 8 byte payload) manufacturer defined broadcast (PGNs 0xFF00 through 0xFFFF) messages are being dropped.

It looks like what happens is if I send, for example, three large packets one right after the other, each with a different PGN (say, 0xFF00, 0xFF01,0xFF02) then only the first packet makes it through.

What is the expected behavior? How should the messages be buffered if the driver is in the middle of something?

In dmesg, I get these messages from the driver:

[ 4984.339692] j1939_xtp_rx_rts: connection exists (4 78 ff)
[ 4985.596175] j1939_tp_rxtimer: timeout on 4

I have a modified copy of transport.c here:
https://github.com/rpkvt/J1939-Driver/blob/master/linux/net/can/j1939/transport.c#L727

Thanks,
Rich

commented

Hi Kurt,
Yes, that actually did help. Thank you.

I was able to get my software to work continuously by distinguishing between short broadcast packets (8 bytes) and large broadcast packets (>8 bytes). Essentially, I set up one program to only send large packets using message flag MSG_SYN and the other program sends short packets using MSG_DONOTWAIT.

I'm planning on implementing this functionality into one program.

You can see from the image below that the counts are about the same:

image

A couple more data points:

  • If I use MSG_SYN when sending only long broadcast packets, then I get all of the packets I send.

  • If I use MSG_SYN and send long broadcast packets along with short broadcast packets (8 bytes) then sendto fails to return and the program waits indefinitely. It'll fail on sending a large packet if preceded by sending a small packet.

  • If I use MSG_DONOTWAIT then I get the short broadcast packets but only 1 of the large broadcast packets.

  • If I set the message flag to MSG_SYN or MSG_DONOTWAIT based on the packet size, then sendto will fail to return like it did when only using MSG_SYN.

  • I ran two copies of the program, one which used MSG_SYN and would exclusively send large packets and the other which used MSG_DONOTWAIT and would send exclusively small packets. This setup seems to work.

  • With the message flags set to 0, there were no errors returned and only one large broadcast packet was sent.

  • If I use both MSG_SYN and MSG_DONOTWAIT, then I do occasionally get an error returned. It wasn't consistent.

Rich

commented

Hi Kurt,
I'll do as you ask as soon as I find time. :) I have another problem I'm working on so I'll start a new thread for your input.

Rich