openbmc / linux

OpenBMC Linux kernel source tree

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

fsi: Unable to handle kernel NULL pointer dereference

wangzqbj opened this issue · comments

root@fp5280g2:~# [ 100.885399] Unable to handle kernel NULL pointer dereference at virtual address 00000060
[ 100.893632] pgd = 7938abee
[ 100.896357] [00000060] *pgd=958a6831, *pte=00000000, *ppte=00000000
[ 100.902799] Internal error: Oops: 17 [#1] ARM
[ 100.907191] CPU: 0 PID: 1422 Comm: openpower-proc- Tainted: G W 5.1.6-2f50135-dirty-2b98b34 #1
[ 100.917100] Hardware name: Generic DT based system
[ 100.921911] PC is at device_del+0x40/0x370
[ 100.926017] LR is at device_del+0x38/0x370
[ 100.930111] pc : [<80411de8>] lr : [<80411de0>] psr: 60000013
[ 100.936375] sp : 946a3d58 ip : 946a3d58 fp : 946a3da4
[ 100.941591] r10: 00000051 r9 : 946a3f60 r8 : 80a07008
[ 100.946811] r7 : 9450d800 r6 : 80523ff8 r5 : 94503634 r4 : 94503600
[ 100.953326] r3 : 9e3a8aa0 r2 : 00000000 r1 : 00000000 r0 : 94503634
[ 100.959845] Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment none
[ 100.966970] Control: 00c5387d Table: 9582c008 DAC: 00000051
[ 100.972715] Process openpower-proc- (pid: 1422, stack limit = 0xfe0c27f9)
[ 100.979503] Stack: (0x946a3d58 to 0x946a4000)
[ 100.983865] 3d40: 80256bd0 801f74c0
[ 100.992041] 3d60: 8f40c2a8 80a07008 957f2180 00000000 00000000 6273f3e8 946a3e04 94503600
[ 101.000213] 3d80: 945036f4 80523ff8 80a07008 94c2fc10 946a3f60 00000051 946a3dbc 946a3da8
[ 101.008388] 3da0: 8023ef2c 80411db4 94503600 00000000 946a3dd4 946a3dc0 80524028 8023ef14
[ 101.016559] 3dc0: 94509910 00000000 946a3e04 946a3dd8 8041259c 80524004 946a3e04 94509f00
[ 101.024732] 3de0: 94509910 6273f3e8 94503400 945034fc 94c2fc00 9465f420 946a3e1c 946a3e08
[ 101.032905] 3e00: 805258a0 80412540 00000001 00000000 946a3e34 946a3e20 805258dc 80525878
[ 101.041076] 3e20: 805258c0 00000000 946a3e4c 946a3e38 804103c8 805258cc 804103a0 00000000
[ 101.049249] 3e40: 946a3e64 946a3e50 802aca88 804103ac 00000000 00000000 946a3e9c 946a3e68
[ 101.057423] 3e60: 802abeb8 802aca4c 00000000 00000000 00000100 95076780 80a07008 802abdac
[ 101.065597] 3e80: 946a3f60 00000000 00000001 946a3f60 946a3f24 946a3ea0 80237e10 802abdb8
[ 101.073775] 3ea0: 801ec8cc 95032738 946a3f24 946a3eb8 8021def8 801ec8d8 00000000 80237a60
[ 101.081949] 3ec0: 00000000 9508fd20 00000054 00100cca 000000f7 4c9f7000 9582d320 9582d320
[ 101.090121] 3ee0: 00000000 00000000 00000000 802397b8 00000000 6273f3e8 00000000 00000001
[ 101.098293] 3f00: 95076780 007a03d0 946a3f60 00000000 00000000 00000000 946a3f54 946a3f28
[ 101.106468] 3f20: 802397f4 80237dd8 802458f4 8025936c 946a3f54 95076780 95076780 80a07008
[ 101.114638] 3f40: 007a03d0 00000000 946a3f94 946a3f58 80239aa0 80239750 00000000 00000000
[ 101.122812] 3f60: 00000000 00000000 946a3fac 6273f3e8 00000001 007a03d0 00000003 00000004
[ 101.130987] 3f80: 801011e4 946a2000 946a3fa4 946a3f98 80239b30 80239a3c 00000000 946a3fa8
[ 101.139164] 3fa0: 80101000 80239b24 00000001 007a03d0 00000003 007a03d0 00000001 00000000
[ 101.147338] 3fc0: 00000001 007a03d0 00000003 00000004 0006ac04 0006ac18 7eea1c88 7eea1a1c
[ 101.155510] 3fe0: 00000070 7eea19a8 4c9984ac 4c604fec 60000010 00000003 00000000 00000000
[ 101.163671] Backtrace:
[ 101.166142] [<80411da8>] (device_del) from [<8023ef2c>] (cdev_device_del+0x24/0x3c)
[ 101.173800] r10:00000051 r9:946a3f60 r8:94c2fc10 r7:80a07008 r6:80523ff8 r5:945036f4
[ 101.181623] r4:94503600
[ 101.184175] [<8023ef08>] (cdev_device_del) from [<80524028>] (fsi_master_remove_slave+0x30/0x44)
[ 101.192955] r5:00000000 r4:94503600
[ 101.196548] [<80523ff8>] (fsi_master_remove_slave) from [<8041259c>] (device_for_each_child+0x68/0xa4)
[ 101.205839] r5:00000000 r4:94509910
[ 101.209425] [<80412534>] (device_for_each_child) from [<805258a0>] (fsi_master_rescan+0x34/0x54)
[ 101.218204] r7:9465f420 r6:94c2fc00 r5:945034fc r4:94503400
[ 101.223867] [<8052586c>] (fsi_master_rescan) from [<805258dc>] (master_rescan_store+0x1c/0x28)
[ 101.232473] r5:00000000 r4:00000001
[ 101.236053] [<805258c0>] (master_rescan_store) from [<804103c8>] (dev_attr_store+0x28/0x34)
[ 101.244398] r5:00000000 r4:805258c0
[ 101.247995] [<804103a0>] (dev_attr_store) from [<802aca88>] (sysfs_kf_write+0x48/0x54)
[ 101.255907] r5:00000000 r4:804103a0
[ 101.259496] [<802aca40>] (sysfs_kf_write) from [<802abeb8>] (kernfs_fop_write+0x10c/0x1ec)
[ 101.267754] r5:00000000 r4:00000000
[ 101.271343] [<802abdac>] (kernfs_fop_write) from [<80237e10>] (__vfs_write+0x44/0x18c)
[ 101.279261] r10:946a3f60 r9:00000001 r8:00000000 r7:946a3f60 r6:802abdac r5:80a07008
[ 101.287078] r4:95076780
[ 101.289622] [<80237dcc>] (__vfs_write) from [<802397f4>] (vfs_write+0xb0/0x194)
[ 101.296932] r10:00000000 r9:00000000 r8:00000000 r7:946a3f60 r6:007a03d0 r5:95076780
[ 101.304750] r4:00000001
[ 101.307289] [<80239744>] (vfs_write) from [<80239aa0>] (ksys_write+0x70/0xe8)
[ 101.314420] r8:00000000 r7:007a03d0 r6:80a07008 r5:95076780 r4:95076780
[ 101.321124] [<80239a30>] (ksys_write) from [<80239b30>] (sys_write+0x18/0x1c)
[ 101.328254] r9:946a2000 r8:801011e4 r7:00000004 r6:00000003 r5:007a03d0 r4:00000001
[ 101.335997] [<80239b18>] (sys_write) from [<80101000>] (ret_fast_syscall+0x0/0x54)
[ 101.343563] Exception stack(0x946a3fa8 to 0x946a3ff0)
[ 101.348620] 3fa0: 00000001 007a03d0 00000003 007a03d0 00000001 00000000
[ 101.356795] 3fc0: 00000001 007a03d0 00000003 00000004 0006ac04 0006ac18 7eea1c88 7eea1a1c
[ 101.364962] 3fe0: 00000070 7eea19a8 4c9984ac 4c604fec
[ 101.370022] Code: e50b3030 eb0a20c0 e5942004 e1a00005 (e5d23060)
[ 101.377616] ---[ end trace f6753b63bef46513 ]---
[ 101.382263] Kernel panic - not syncing: Fatal exception
[ 101.387499] ---[ end Kernel panic - not syncing: Fatal exception ]---

obmcutil poweron & journalctl -f
Jun 06 01:15:29 fp5280g2 systemd[1]: Started SSH Per-Connection Server (100.2.56.35:36652).
Jun 06 01:15:29 fp5280g2 dropbear[1421]: Child connection from ::ffff:100.2.56.35:36652
Jun 06 01:15:31 fp5280g2 dropbear[1421]: PAM password auth succeeded for 'root' from ::ffff:100.2.56.35:36652
Jun 06 01:15:31 fp5280g2 dropbear[1421]: Exit (root): Disconnect received
Jun 06 01:15:31 fp5280g2 systemd[1]: dropbear@2-100.2.36.164:22-100.2.56.35:36652.service: Succeeded.
Jun 06 01:15:41 fp5280g2 phosphor-host-state-manager[1388]: Host State transaction request
Jun 06 01:15:42 fp5280g2 systemd[1]: Created slice system-fsi\x2dscan.slice.
Jun 06 01:15:42 fp5280g2 systemd[1]: Created slice system-phosphor\x2dreset\x2dhost\x2dreboot\x2dattempts.slice.
Jun 06 01:15:42 fp5280g2 systemd[1]: Starting Wait for /org/openbmc/mboxd...
Jun 06 01:15:42 fp5280g2 systemd[1]: Created slice system-mboxd\x2dreload.slice.
Jun 06 01:15:42 fp5280g2 systemd[1]: Created slice system-start_host.slice.
Jun 06 01:15:42 fp5280g2 systemd[1]: Starting Wait for /xyz/openbmc_project/watchdog/host0...
Jun 06 01:15:42 fp5280g2 systemd[1]: Starting Wait for /xyz/openbmc_project/led/groups/power_on...
Jun 06 01:15:42 fp5280g2 systemd[1]: Created slice system-op\x2dpower\x2dstart.slice.
Jun 06 01:15:42 fp5280g2 systemd[1]: Created slice system-phosphor\x2dfan\x2dpresence\x2dtach.slice.
Jun 06 01:15:42 fp5280g2 systemd[1]: Created slice system-obmc\x2denable\x2dhost\x2dwatchdog.slice.
Jun 06 01:15:42 fp5280g2 systemd[1]: Created slice system-cfam_override.slice.
Jun 06 01:15:42 fp5280g2 systemd[1]: Created slice system-phosphor\x2dfan\x2dmonitor\x2dinit.slice.
Jun 06 01:15:42 fp5280g2 systemd[1]: Created slice system-op\x2dwait\x2dpower\x2don.slice.
Jun 06 01:15:42 fp5280g2 systemd[1]: Starting Wait for Power0 to turn on...
Jun 06 01:15:42 fp5280g2 systemd[1]: Created slice system-mapper\x2dsubtree\x2dremove.slice.
Jun 06 01:15:42 fp5280g2 systemd[1]: Starting mapper subtree-remove /xyz/openbmc_project/software:xyz.openbmc_project.Software.ActivationBlocksTransition...
Jun 06 01:15:42 fp5280g2 systemd[1]: Created slice system-phosphor\x2dfan\x2dcontrol\x2dinit.slice.
Jun 06 01:15:42 fp5280g2 systemd[1]: Created slice system-phosphor\x2dwatchdog.slice.
Jun 06 01:15:42 fp5280g2 systemd[1]: Created slice system-phosphor\x2dgpio\x2dmonitor.slice.
Jun 06 01:15:43 fp5280g2 systemd[1]: Started Phosphor GPIO checkstop monitor.
Jun 06 01:15:43 fp5280g2 systemd[1]: Starting Reset host reboot counter...
Jun 06 01:15:43 fp5280g2 systemd[1]: Started Phosphor poweron watchdog.
Jun 06 01:15:43 fp5280g2 systemd[1]: Started Wait for /org/openbmc/mboxd.
Jun 06 01:15:43 fp5280g2 systemd[1]: Started Wait for /xyz/openbmc_project/led/groups/power_on.
Jun 06 01:15:43 fp5280g2 phosphor-watchdog[1436]: Action Targets:
Jun 06 01:15:43 fp5280g2 phosphor-watchdog[1436]: xyz.openbmc_project.State.Watchdog.Action.PowerCycle -> obmc-host-timeout@0.target
Jun 06 01:15:43 fp5280g2 phosphor-watchdog[1436]: xyz.openbmc_project.State.Watchdog.Action.HardReset -> obmc-host-timeout@0.target
Jun 06 01:15:43 fp5280g2 phosphor-watchdog[1436]: xyz.openbmc_project.State.Watchdog.Action.PowerOff -> obmc-host-timeout@0.target
Jun 06 01:15:43 fp5280g2 systemd[1]: Started mapper subtree-remove /xyz/openbmc_project/software:xyz.openbmc_project.Software.ActivationBlocksTransition.
Jun 06 01:15:43 fp5280g2 systemd[1]: phosphor-reset-host-reboot-attempts@0.service: Succeeded.
Jun 06 01:15:43 fp5280g2 systemd[1]: Started Reset host reboot counter.
Jun 06 01:15:43 fp5280g2 systemd[1]: Started Wait for /xyz/openbmc_project/watchdog/host0.
Jun 06 01:15:44 fp5280g2 systemd[1]: mapper-subtree-remove@-xyz-openbmc\x5fproject-software\x3Axyz.openbmc_project.Software.ActivationBlocksTransition.service: Succeeded.
Jun 06 01:15:44 fp5280g2 systemd[1]: Starting Assert power_on LED...
Jun 06 01:15:44 fp5280g2 systemd[1]: Starting Reload mboxd during power on...
Jun 06 01:15:44 fp5280g2 mboxctl[1443]: Reset: Success
Jun 06 01:15:44 fp5280g2 systemd[1]: Started Reload mboxd during power on.
Jun 06 01:15:44 fp5280g2 systemd[1]: Reached target Power0 On (Pre).
Jun 06 01:15:44 fp5280g2 systemd[1]: Starting Start Power0...
Jun 06 01:15:44 fp5280g2 systemd[1]: Started Assert power_on LED.
Jun 06 01:15:44 fp5280g2 power_control.exe[1290]: PowerControl: setting power up BMC_CPLD_SOFTWARE_PG_N to 0
Jun 06 01:15:44 fp5280g2 power_control.exe[1290]: PowerControl: setting power up BMC_CPLD_SYS_PWRON to 0
Jun 06 01:15:44 fp5280g2 systemd[1]: Started Start Power0.
Jun 06 01:15:48 fp5280g2 systemd[1]: Started Wait for Power0 to turn on.
Jun 06 01:15:48 fp5280g2 systemd[1]: Reached target Power0 On.
Jun 06 01:15:48 fp5280g2 systemd[1]: Reached target Power0 (On).
Jun 06 01:15:48 fp5280g2 systemd[1]: Started Phosphor Fan Control Initialization.
Jun 06 01:15:48 fp5280g2 systemd[1]: Started Phosphor Fan Monitor Initialization.
Jun 06 01:15:49 fp5280g2 systemd[1]: Started Phosphor Fan Presence Tach Daemon.
Jun 06 01:15:49 fp5280g2 systemd[1]: Starting Scan FSI devices...
Jun 06 01:15:49 fp5280g2 systemd[1]: phosphor-fan-monitor-init@0.service: Succeeded.

It happened when the command obmcutil poweron executed for the first time after AC power applied

  1. AC applied:
    ac_applied_before_poweron.log
  2. obmcutil poweron
    first_poweron.log
  3. kernel panic
    kernel_panic.log
  4. bmc reboot
    bmc_reboot.log
  5. obmcutil poweron
    second_poweron.log

All operations are executed via minicom, there is a log recorded by mincom since AC applied.
obmc-minicom.log

Login to bmc before and after kernel panic , the logs are captured by journalctl.
obmc-boot-before-panic.log
obmc-boot-after-panic.log

what I did:

  1. AC applied
  2. Login to bmc
  3. journalctl > /tmp/log # It's obmc-boot-before-panic.log
  4. journalctl -f &
  5. obmcutil poweron
  6. Login to bmc
  7. journalctl > /tmp/log # It's obmc-boot-after-panic.log
  8. obmcutil poweron (host boots successfully)
  9. obmcutil poweroff
  10. obmcutil poweron (boot host successfully)

Sorry I did not merge these logs, for worried aobut missing some information.

To clarify: the issue only happens on the first power on after AC cycle;
It will not happen after BMC reboots.

The issue is related to NULL pointer dereference in fsi_master_remove_slave.
The related code is:

   1 static int fsi_slave_remove_device(struct device *dev, void *arg)
   2 {
   3     device_unregister(dev);
   4     return 0;
   5 }
   6
   7 static int fsi_master_remove_slave(struct device *dev, void *arg)
   8 {
   9     struct fsi_slave *slave = to_fsi_slave(dev);
  10
  11     device_for_each_child(dev, NULL, fsi_slave_remove_device);
  12     cdev_device_del(&slave->cdev, &slave->dev);
  13     put_device(dev);
  14     return 0;
  15 }

There is a weird log in kernel's fsi driver in ac_applied_before_poweron.log:

Jun 06 01:05:42 fp5280g2 kernel:  fsi0: can't set smode on slave:00:00 -5

This happens in fsi_slave_init():

   1     rc = fsi_slave_set_smode(slave);
   2     if (rc) {
   3         dev_warn(&master->dev,
   4                 "can't set smode on slave:%02x:%02x %d\n",
   5                 link, id, rc);
   6         kfree(slave);
   7         return -ENODEV;
   8     }

So it could be the case that the slave device is NULL, and while during fsi rescan, the code uses the NULL pointer.

But it's really weird that the above log only occurs in the first boot after AC cycle.

From past experience, I expect the problem is calling device_del for a device that was never added. The device model core trys to do a WARN that the private data was not initialized, but instead dereferences the NULL pointer when trying to obtain the name to print because it was never set.

The code

   1     rc = fsi_slave_set_smode(slave);
   2     if (rc) {
   3         dev_warn(&master->dev,
   4                 "can't set smode on slave:%02x:%02x %d\n",
   5                 link, id, rc);
   6         kfree(slave);
   7         return -ENODEV;
   8     }

free and return directly, instead of goto err_free, which seems wrong.

By fixing the piece of code, this issue is gone.


Update:
With the above fix, the issue is gone but kernel gives another error which indicates something is used after it's freed.

Jun 13 08:13:36 fp5280g2 kernel: WARNING: CPU: 0 PID: 1417 at lib/refcount.c:190 refcount_sub_and_test_checked+0x94/0xac
Jun 13 08:13:36 fp5280g2 kernel: refcount_t: underflow; use-after-free.
...

The root cause is not fixed.

It appears two kobjects are both created wtihin the fsi_slave structure, both a cdev and a dev. device. That can never work as they each have their own release method.