dls-controls / pmac

Driver for the Delta Tau PMAC motion controller family.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Intermittent connection dropout, even when all motor channels are completely idle

jwlodek opened this issue · comments

We have been running into an issue with several of our PowerBrick controllers at NSLS-II that we haven't been able to diagnose/resolve. Essentially, every 3-5 days or so, we get the below in the IOC shell:

epics> 2023/08/30 18:28:29.475 pmacHardwareTurbo::parseAxisStatus Failed to parse axis status (24 bit) =>  0
2023/08/30 18:28:29.475 pmacHardwareTurbo::parseAxisStatus Failed to parse axis status (16 bit) =>  0
2023/08/30 18:28:29.475 getAxisStatus: Failed to parse position. Key: #6P  Value:  $0000c10000000000
2023/08/30 18:28:29.475 getAxisStatus: Failed to parse following error. Key: #7F  Value: d
2023/08/31 07:38:39.304 pmacMessageBroker::lowLevelWriteRead Connection to hardware lost
2023/08/31 07:38:41.689296 P0 XF:27IDA-OP:1{Mono:DCLM-Ax:C2R}AxisEnable-Sts: Input "<0a>M139=0<0d><0a>" does not match format "%d"
2023/08/31 07:38:41.690074 P0 XF:27IDA-OP:1{Mono:DCLM-Ax:C1Y}AxisEnable-Sts: Input "<0a>M239=0<0d><0a>" does not match format "%d"
2023/08/31 07:38:41.690800 P0 XF:27IDA-OP:1{Mono:DCLM-Ax:C1P}AxisEnable-Sts: Input "<0a>M339=0<0d><0a>" does not match format "%d"
2023/08/31 07:38:41.691481 P0 XF:27IDA-OP:1{Mono:DCLM-Ax:C2P}AxisEnable-Sts: Input "<0a>M439=0<0d><0a>" does not match format "%d"
2023/08/31 07:38:41.692103 P0 XF:27IDA-OP:1{Mono:DCLM-Ax:BS}AxisEnable-Sts: Input "<0a>M539=0<0d><0a>" does not match format "%d"
2023/08/31 07:38:41.692866 P0 XF:27IDA-OP:1{Mono:DCLM-Ax:FS}AxisEnable-Sts: Input "<0a>M639=0<0d><0a>" does not match format "%d"
2023/08/31 07:38:41.693568 P0 XF:27IDA-OP:1{MC:3-Ax:7}AxisEnable-Sts: Input "<0a>M739=0<0d><0a>" does not match format "%d"
2023/08/31 07:38:41.694409 P0 XF:27IDA-OP:1{MC:3-Ax:8}AxisEnable-Sts: Input "<0a>M839=0<0d><0a>" does not match format "%d"
2023/08/31 07:38:41.695198 P0 XF:27IDA-OP:1{Mono:DCLM-Ax:C2R}PhaseCur-I: Input "<0a>M176=0<0d><0a>" does not match format "%d"
2023/08/31 07:38:41.696033 P0 XF:27IDA-OP:1{Mono:DCLM-Ax:C1Y}PhaseCur-I: Input "<0a>M276=0<0d><0a>" does not match format "%d"
2023/08/31 07:38:41.696679 P0 XF:27IDA-OP:1{Mono:DCLM-Ax:C1P}PhaseCur-I: Input "<0a>M376=0<0d><0a>" does not match format "%d"
2023/08/31 07:38:41.697360 P0 XF:27IDA-OP:1{Mono:DCLM-Ax:C2P}PhaseCur-I: Input "<0a>M476=0<0d><0a>" does not match format "%d"
2023/08/31 07:38:41.698163 P0 XF:27IDA-OP:1{Mono:DCLM-Ax:BS}PhaseCur-I: Input "<0a>M576=0<0d><0a>" does not match format "%d"
2023/08/31 07:38:41.698870 P0 XF:27IDA-OP:1{Mono:DCLM-Ax:FS}PhaseCur-I: Input "<0a>M676=0<0d><0a>" does not match format "%d"
2023/08/31 07:38:41.699585 P0 XF:27IDA-OP:1{MC:3-Ax:7}PhaseCur-I: Input "<0a>M776=0<0d><0a>" does not match format "%d"
2023/08/31 07:38:41.700237 P0 XF:27IDA-OP:1{MC:3-Ax:8}PhaseCur-I: Input "<0a>M876=0<0d><0a>" does not match format "%d"
2023/08/31 07:38:41.700939 P0 XF:27IDA-CT{MC:03-Ax:1}Cur-I: Input "<0a>M176=0<0d><0a>" does not match format "%d"
2023/08/31 07:38:41.701610 P0 XF:27IDA-CT{MC:03-Ax:2}Cur-I: Input "<0a>M276=0<0d><0a>" does not match format "%d"
2023/08/31 07:38:41.702364 P0 XF:27IDA-CT{MC:03-Ax:3}Cur-I: Input "<0a>M376=0<0d><0a>" does not match format "%d"
2023/08/31 07:38:41.702967 P0 XF:27IDA-CT{MC:03-Ax:4}Cur-I: Input "<0a>M476=0<0d><0a>" does not match format "%d"
2023/08/31 07:38:41.703614 P0 XF:27IDA-CT{MC:03-Ax:5}Cur-I: Input "<0a>M576=0<0d><0a>" does not match format "%d"
2023/08/31 07:38:41.704267 P0 XF:27IDA-CT{MC:03-Ax:6}Cur-I: Input "<0a>M676=0<0d><0a>" does not match format "%d"
2023/08/31 07:38:41.704949 P0 XF:27IDA-CT{MC:03-Ax:7}Cur-I: Input "<0a>M776=0<0d><0a>" does not match format "%d"
2023/08/31 07:38:41.705593 P0 XF:27IDA-CT{MC:03-Ax:8}Cur-I: Input "<0a>M876=0<0d><0a>" does not match format "%d"
2023/08/31 07:38:41.709579 P0 XF:27IDA-CT{MC:03}Val:CPULoad-I: Input "<0a>P575=0<0d><0a>" does not match format "%f"
2023/08/31 07:38:41.710145 P0 XF:27IDA-CT{MC:03}Enbl:PLC_W0-Sts: Input "<0a>P592=1280<0d><0a>" does not match format "%d"
2023/08/31 07:38:41.710715 P0 XF:27IDA-CT{MC:03}Val:Pgm-Sts: Input "<0a>P596=0<0d><0a>" does not match format "%d"
2023/08/31 07:38:41.711276 P0 XF:27IDA-CT{MC:03}Cnt:MACRO_Err-I: Input "<0a>M5035=0<0d><0a>" does not match format "%d"
2023/08/31 07:38:41.711846 P0 XF:27IDA-CT{MC:03}Val:CPUTemp-I: Input "<0a>Sys.CpuTemp=0<0d><0a>" does not match format "%f"
2023/08/31 07:38:41.712535 P0 XF:27IDA-CT{MC:03}In:GPIO-Sts: Input "<0a>P594=0<0d><0a>" does not match format "%d"
2023/08/31 07:38:41.713111 P0 XF:27IDA-CT{MC:03}Out:GPIO-Sel: Input "<0a>P595=0<0d><0a>" does not match format "%d"
2023/08/31 07:38:41.721774 P0 XF:27IDA-OP:1{Mono:DCLM-Ax:C2R}Flt:ELoss-Sts: Input "<0a>P465=0<0d><0a>" does not match format "%i"
2023/08/31 07:38:41.722302 P0 XF:27IDA-OP:1{Mono:DCLM-Ax:C1Y}Flt:ELoss-Sts: Input "<0a>P466=0<0d><0a>" does not match format "%i"
2023/08/31 07:38:41.722951 P0 XF:27IDA-OP:1{Mono:DCLM-Ax:C1P}Flt:ELoss-Sts: Input "<0a>P467=0<0d><0a>" does not match format "%i"
2023/08/31 07:38:41.723458 P0 XF:27IDA-OP:1{Mono:DCLM-Ax:C2P}Flt:ELoss-Sts: Input "<0a>P468=0<0d><0a>" does not match format "%i"
2023/08/31 07:38:41.724001 P0 XF:27IDA-OP:1{Mono:DCLM-Ax:BS}Flt:ELoss-Sts: Input "<0a>P469=0<0d><0a>" does not match format "%i"
2023/08/31 07:38:41.724527 P0 XF:27IDA-OP:1{Mono:DCLM-Ax:FS}Flt:ELoss-Sts: Input "<0a>P470=0<0d><0a>" does not match format "%i"
2023/08/31 07:38:41.725059 P0 XF:27IDA-OP:1{MC:3-Ax:7}Flt:ELoss-Sts: Input "<0a>P471=0<0d><0a>" does not match format "%i"
2023/08/31 07:38:41.725674 P0 XF:27IDA-OP:1{MC:3-Ax:8}Flt:ELoss-Sts: Input "<0a>P472=0<0d><0a>" does not match format "%i"
2023/08/31 07:38:41.731 pmacMessageBroker::getConnectedStatus Connection to hardware restored
2023/08/31 07:38:41.733771 P0 XF:27IDA-CT{MC:03}Enbl:PLC_W1-Sts: Input "<0a>P593=0<0d><0a>" does not match format "%d"

The only solution once this crops up is to reboot the IOC, and sometimes, reboot the controller as well. Has anyone else seen this behavior? Could this be caused by anything on the IOC or the controller network stack? We were able to verify that hardware did not power cycle, and we don't believe the network globally dropped out - other IOCs did not lose connection to instruments during the same time (even other PowerBricks - this occurs at varied times for different controllers).

@ajgdls have you seen this before?

Hello @jwlodek,

To help diagnose the issue, could you please provide some additional information?

  1. What is the version of the pmac driver that is being used?
  2. Could you specify the CPU architecture of the controllers - the ones that often disconnects, and the ones that it don't?

The problem seems similar to some issues found when the driver is used in a ARM-based Power Brick.

I would recommend testing with version 2-6-0.

I hope that helps.

Hi, this does look like a similar issue to the problem we recently resolved. The initial error looks like the incorrect response has come in for the request, which was a symptom of the low level driver problem discovered. The latest release should resolve that, if it is the same problem.

We are currently using driver version R2-5-16 I believe, with mostly ARM based controllers. The PR 103 fix does seem to describe the behavior we are seeing. I will try swapping to this new driver version, and will report back

We moved the IOC to use R2-6-0, but the problem persists:

info_positions.sav: 531 of 531 PV's connected
epics> 2023/10/25 08:25:32.188034 P0 XF:27IDF-OP:1{CMT:1-Ax:X}AxisEnable-Sts: Input "<0a>M139=0<0d><0a>" does not match format "%d"
2023/10/25 08:25:32.189224 P0 XF:27IDF-OP:1{CMT:1-Ax:Yc}AxisEnable-Sts: Input "<0a>M239=0<0d><0a>" does not match format "%d"
2023/10/25 08:25:32.190587 P0 XF:27IDF-OP:1{CMT:1-Ax:Yf}AxisEnable-Sts: Input "<0a>M339=0<0d><0a>" does not match format "%d"
2023/10/25 08:25:32.191768 P0 XF:27IDF-OP:1{Slt:CMT-Ax:Pitch}AxisEnable-Sts: Input "<0a>M439=0<0d><0a>" does not match format "%d"
2023/10/25 08:25:32.192858 P0 XF:27IDF-OP:1{Slt:CMT-Ax:I}AxisEnable-Sts: Input "<0a>M539=0<0d><0a>" does not match format "%d"
2023/10/25 08:25:32.194222 P0 XF:27IDF-OP:1{Slt:CMT-Ax:O}AxisEnable-Sts: Input "<0a>M639=0<0d><0a>" does not match format "%d"
2023/10/25 08:25:32.195313 P0 XF:27IDF-OP:1{Slt:CMT-Ax:B}AxisEnable-Sts: Input "<0a>M739=0<0d><0a>" does not match format "%d"
2023/10/25 08:25:32.196494 P0 XF:27IDF-OP:1{Slt:CMT-Ax:T}AxisEnable-Sts: Input "<0a>M839=0<0d><0a>" does not match format "%d"
2023/10/25 08:25:32.197903 P0 XF:27IDF-CT{MC:06}Val:CPULoad-I: Input "<0a>P575=0<0d><0a>" does not match format "%f"
2023/10/25 08:25:32.198947 P0 XF:27IDF-CT{MC:06}Enbl:PLC_W0-Sts: Input "<0a>P592=1280<0d><0a>" does not match format "%d"
2023/10/25 08:25:32.200060 P0 XF:27IDF-CT{MC:06}Val:Pgm-Sts: Input "<0a>P596=0<0d><0a>" does not match format "%d"
2023/10/25 08:25:32.201288 P0 XF:27IDF-CT{MC:06}Cnt:MACRO_Err-I: Input "<0a>M5035=0<0d><0a>" does not match format "%d"
2023/10/25 08:25:32.202609 P0 XF:27IDF-CT{MC:06}Val:CPUTemp-I: Input "<0a>Sys.CpuTemp=37<0d><0a>" does not match format "%f"
2023/10/25 08:25:32.203858 P0 XF:27IDF-CT{MC:06}In:GPIO-Sts: Input "<0a>P594=0<0d><0a>" does not match format "%d"
2023/10/25 08:25:32.204905 P0 XF:27IDF-CT{MC:06}Out:GPIO-Sel: Input "<0a>P595=0<0d><0a>" does not match format "%d"
2023/10/25 08:25:32.206259 P0 XF:27IDF-OP:1{CMT:1-Ax:X}PhaseCur-I: Input "<0a>M176=0<0d><0a>" does not match format "%d"
2023/10/25 08:25:32.207317 P0 XF:27IDF-OP:1{CMT:1-Ax:Yc}PhaseCur-I: Input "<0a>M276=0<0d><0a>" does not match format "%d"
2023/10/25 08:25:32.208690 P0 XF:27IDF-OP:1{CMT:1-Ax:Yf}PhaseCur-I: Input "<0a>M376=0<0d><0a>" does not match format "%d"
2023/10/25 08:25:32.209992 P0 XF:27IDF-OP:1{Slt:CMT-Ax:Pitch}PhaseCur-I: Input "<0a>M476=0<0d><0a>" does not match format "%d"
2023/10/25 08:25:32.211390 P0 XF:27IDF-OP:1{Slt:CMT-Ax:I}PhaseCur-I: Input "<0a>M576=0<0d><0a>" does not match format "%d"
2023/10/25 08:25:32.212335 P0 XF:27IDF-OP:1{Slt:CMT-Ax:O}PhaseCur-I: Input "<0a>M676=0<0d><0a>" does not match format "%d"
2023/10/25 08:25:32.213303 P0 XF:27IDF-OP:1{Slt:CMT-Ax:B}PhaseCur-I: Input "<0a>M776=0<0d><0a>" does not match format "%d"
2023/10/25 08:25:32.214662 P0 XF:27IDF-OP:1{Slt:CMT-Ax:T}PhaseCur-I: Input "<0a>M876=0<0d><0a>" does not match format "%d"
2023/10/25 08:25:32.215811 P0 XF:27IDF-CT{MC:06-Ax:1}Cur-I: Input "<0a>Motor[1].IdMeas=8<0d><0a>" does not match format "%d"
2023/10/25 08:25:32.216967 P0 XF:27IDF-CT{MC:06-Ax:2}Cur-I: Input "<0a>Motor[2].IdMeas=-22..." does not match format "%d"
2023/10/25 08:25:32.218214 P0 XF:27IDF-CT{MC:06-Ax:3}Cur-I: Input "<0a>Motor[3].IdMeas=4<0d><0a>" does not match format "%d"
2023/10/25 08:25:32.219468 P0 XF:27IDF-CT{MC:06-Ax:4}Cur-I: Input "<0a>Motor[4].IdMeas=10...." does not match format "%d"
2023/10/25 08:25:32.220604 P0 XF:27IDF-CT{MC:06-Ax:5}Cur-I: Input "<0a>Motor[5].IdMeas=12<0d>..." does not match format "%d"
2023/10/25 08:25:32.221958 P0 XF:27IDF-CT{MC:06-Ax:6}Cur-I: Input "<0a>Motor[6].IdMeas=-20..." does not match format "%d"
2023/10/25 08:25:32.223035 P0 XF:27IDF-CT{MC:06-Ax:7}Cur-I: Input "<0a>Motor[7].IdMeas=-4<0d>..." does not match format "%d"
2023/10/25 08:25:32.224191 P0 XF:27IDF-CT{MC:06-Ax:8}Cur-I: Input "<0a>Motor[8].IdMeas=4<0d><0a>" does not match format "%d"
2023/10/25 08:25:32.228 pmacController::prefastUpdate Error reading trajectory status => M4034
2023/10/25 08:25:32.228 pmacController::prefastUpdate     nvals => 0
2023/10/25 08:25:32.228 pmacController::prefastUpdate     response =>  M4034=0
2023/10/25 08:25:32.248868 P0 XF:27IDF-OP:1{CMT:1-Ax:X}Flt:ELoss-Sts: Input "<0a>P465=0<0d><0a>" does not match format "%i"
2023/10/25 08:25:32.250386 P0 XF:27IDF-OP:1{CMT:1-Ax:Yc}Flt:ELoss-Sts: Input "<0a>P466=0<0d><0a>" does not match format "%i"
2023/10/25 08:25:32.251507 P0 XF:27IDF-OP:1{CMT:1-Ax:Yf}Flt:ELoss-Sts: Input "<0a>P467=0<0d><0a>" does not match format "%i"
2023/10/25 08:25:32.252765 P0 XF:27IDF-OP:1{Slt:CMT-Ax:Pitch}Flt:ELoss-Sts: Input "<0a>P468=0<0d><0a>" does not match format "%i"
2023/10/25 08:25:32.254142 P0 XF:27IDF-OP:1{Slt:CMT-Ax:I}Flt:ELoss-Sts: Input "<0a>P469=0<0d><0a>" does not match format "%i"
2023/10/25 08:25:32.255133 P0 XF:27IDF-OP:1{Slt:CMT-Ax:O}Flt:ELoss-Sts: Input "<0a>P470=0<0d><0a>" does not match format "%i"
2023/10/25 08:25:32.256300 P0 XF:27IDF-OP:1{Slt:CMT-Ax:B}Flt:ELoss-Sts: Input "<0a>P471=0<0d><0a>" does not match format "%i"
2023/10/25 08:25:32.257598 P0 XF:27IDF-OP:1{Slt:CMT-Ax:T}Flt:ELoss-Sts: Input "<0a>P472=0<0d><0a>" does not match format "%i"
2023/10/25 08:25:32.268015 P0 XF:27IDF-CT{MC:06}Enbl:PLC_W1-Sts: Input "<0a>P593=6144<0d><0a>" does not match format "%d"
2023/10/25 08:25:32.276 pmacController::fastUpdate Error reading trajectory current index => M4039
2023/10/25 08:25:32.276 pmacController::fastUpdate     nvals => 0
2023/10/25 08:25:32.276 pmacController::fastUpdate     response =>  M4039=0
2023/10/25 08:25:32.276 pmacController::fastUpdate Error reading trajectory current index => M4038
2023/10/25 08:25:32.276 pmacController::fastUpdate     nvals => 0
2023/10/25 08:25:32.276 pmacController::fastUpdate     response =>  M4038=0
2023/10/25 08:25:32.276 pmacController::fastUpdate Error reading trajectory current buffer => M4040
2023/10/25 08:25:32.276 pmacController::fastUpdate     nvals => 0
2023/10/25 08:25:32.276 pmacController::fastUpdate     response =>  M4040=0
2023/10/25 08:25:32.276 pmacController::parseDoubleVariable Read Error [Phase interrupt time] => Sys.FltrPhaseTime
2023/10/25 08:25:32.276 pmacController::parseDoubleVariable     nvals => 0
2023/10/25 08:25:32.276 pmacController::parseDoubleVariable     response =>  Sys.FltrPhaseTime=13.5362070312500009
2023/10/25 08:25:32.276 pmacController::fastUpdate Error reading or setting params.
2023/10/25 08:25:32.277 pmacHardwarePower::parseAxisStatus Failed to parse CS number =>  Motor[1].Coord=0
2023/10/25 08:25:32.277 pmacHardwarePower::parseAxisStatus Failed to parse CS number =>  Motor[2].Coord=0
2023/10/25 08:25:32.277 pmacHardwarePower::parseAxisStatus Failed to parse CS number =>  Motor[3].Coord=0
2023/10/25 08:25:32.277 pmacHardwarePower::parseAxisStatus Failed to parse CS number =>  Motor[4].Coord=0
2023/10/25 08:25:32.277 pmacHardwarePower::parseAxisStatus Failed to parse CS number =>  Motor[5].Coord=4
2023/10/25 08:25:32.277 pmacHardwarePower::parseAxisStatus Failed to parse CS number =>  Motor[6].Coord=4
2023/10/25 08:25:32.277 pmacHardwarePower::parseAxisStatus Failed to parse CS number =>  Motor[7].Coord=5
2023/10/25 08:25:32.277 pmacHardwarePower::parseAxisStatus Failed to parse CS number =>  Motor[8].Coord=5
2023/10/25 08:25:32.277 pmacCSAxis::GetAxisStatus: Failed to parse position. Key: &4Q81  Value:  Q81=0
2023/10/25 08:25:32.277 pmacCSAxis::GetAxisStatus: Failed to parse position. Key: &4Q82  Value:  Q82=0
2023/10/25 08:25:32.277 pmacCSAxis::GetAxisStatus: Failed to parse position. Key: &4Q83  Value:  Q83=0
2023/10/25 08:25:32.277 pmacCSAxis::GetAxisStatus: Failed to parse position. Key: &4Q84  Value:  Q84=0
...

It appears these errors are being caused due to the echo not being set correctly.

2023/10/25 08:25:32.277 pmacHardwarePower::parseAxisStatus Failed to parse CS number =>  Motor[1].Coord=0
2023/10/25 08:25:32.277 pmacHardwarePower::parseAxisStatus Failed to parse CS number =>  Motor[2].Coord=0
2023/10/25 08:25:32.277 pmacHardwarePower::parseAxisStatus Failed to parse CS number =>  Motor[3].Coord=0
2023/10/25 08:25:32.277 pmacHardwarePower::parseAxisStatus Failed to parse CS number =>  Motor[4].Coord=0
2023/10/25 08:25:32.277 pmacHardwarePower::parseAxisStatus Failed to parse CS number =>  Motor[5].Coord=4
2023/10/25 08:25:32.277 pmacHardwarePower::parseAxisStatus Failed to parse CS number =>  Motor[6].Coord=4
2023/10/25 08:25:32.277 pmacHardwarePower::parseAxisStatus Failed to parse CS number =>  Motor[7].Coord=5
2023/10/25 08:25:32.277 pmacHardwarePower::parseAxisStatus Failed to parse CS number =>  Motor[8].Coord=5
2023/10/25 08:25:32.277 pmacCSAxis::GetAxisStatus: Failed to parse position. Key: &4Q81  Value:  Q81=0
2023/10/25 08:25:32.277 pmacCSAxis::GetAxisStatus: Failed to parse position. Key: &4Q82  Value:  Q82=0
2023/10/25 08:25:32.277 pmacCSAxis::GetAxisStatus: Failed to parse position. Key: &4Q83  Value:  Q83=0
2023/10/25 08:25:32.277 pmacCSAxis::GetAxisStatus: Failed to parse position. Key: &4Q84  Value:  Q84=0

It's expected that during the initialSetup, if cid_==PMAC_CID_POWER_, the echo is set to 7.
That makes the controller returns only the value, without the vairable - e.g. just 5 instead of Motor[8].Coord=5

@jwlodek, could you please check what the controller returns when a cid is issued, or verify the cid_ variable?

Besides that, are you still seeing the message pmacMessageBroker::lowLevelWriteRead Connection to hardware lost after moving to R2-6-0?

We no longer see the Connection to hardware lost message it seems since moving to the new version. We confirmed that the echo is set to 7, and CID for one of the affected controllers is 604020. The same intermittent connection dropout appears to persist, however.

@jwlodek

Are the Power Bricks experiencing this issue covered by IOCs with multiple controllers?

As far as I know, every IOC is mapped directly to a single controller. I am going to try and add some additional debug messages to the driver code to see where exactly it is losing communication. We may also try setting up some of these controllers in our lab to have easier debugging.

Ok, found something new that might be helpful. As I mentioned before, on IOC startup the echo parameter is set to 7 as expected. But at some point while the IOC is running, even without any move commands being issued, echo is being set back to 0. We had a freezed IOC and I checked this and confirmed the echo returned zero. Switching it back to 7, even without rebooting the IOC, recovered communication, and it seemed like even previously buffered move commands executed.

I'm going to put together some simple scripts to monitor the echo paramter while the IOCs are running.