Bumblebee-Project / Bumblebee

Bumblebee daemon and client rewritten in C

Home Page:http://www.bumblebee-project.org/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Laptop freezes when starting X11 and discrete graphics are OFF

jgkamat opened this issue · comments

[edit by @Lekensteyn]
This issue affects newer laptops (from about 2015-2016) with Skylake and GTX 9xxM/10xx cards/
A workaround exists for some laptops, see #764 (comment)
[/edit]


I'm having a weird issue, and I'm not sure what kind of debug information is neccesary, but let me know what to give and I'll supply anything you need.

When I start my graphics (lxdm), I get a freeze (keyboard stops working, no response on monitor at all, even log files stop working), but I can work around this by enabling the graphics card before starting graphics.

System (installed with bumblebee-nvidia in debian testing repos):

Debian Testing
GTX 965M
Nvidia Proprietary Driver: 352.79 
Laptop: SAGER NP7258

Optirun --version:

optirun (Bumblebee) 3.2.1
Copyright (C) 2011 The Bumblebee Project
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

My laptop seems to not work without optimus, the intel drivers work fine, but trying to run w/o the intel drivers (nvidia only) seems to result in a frozen screen. Using the workaround works perfectly for me, however.

Steps to Reproduce:

  1. systemctl start bumblebeed
  2. systemctl start lxdm
  3. Freeze occurs

Workaround:

  1. systemctl start bumblebeed
  2. echo "ON" >/proc/acpi/bbswitch
  3. systemctl start lxdm

Unfortunately, any X11 log files don't seem to survive after my system freezes (they show everything completed successfully, probably from the previous successfull boot). If you know any way of retreiving them I'd be happy to supply them though! (When the system freezes, even my shell history file gets corrupted).

I did have to make some changes to my config files to get things to work in my situation though, I'll post anything I remember changing below. Let me know if you need any more information, I am happy to supply it! Without bumblebee, my laptop would be unusuable 👍

bumblebee.conf

# Configuration file for Bumblebee. Values should **not** be put between quotes

## Server options. Any change made in this section will need a server restart
# to take effect.
[bumblebeed]
# The secondary Xorg server DISPLAY number
VirtualDisplay=:8
# Should the unused Xorg server be kept running? Set this to true if waiting
# for X to be ready is too long and don't need power management at all.
KeepUnusedXServer=false
# The name of the Bumbleblee server group name (GID name)
ServerGroup=bumblebee
# Card power state at exit. Set to false if the card shoud be ON when Bumblebee
# server exits.
TurnCardOffAtExit=false
# The default behavior of '-f' option on optirun. If set to "true", '-f' will
# be ignored.
NoEcoModeOverride=false
# The Driver used by Bumblebee server. If this value is not set (or empty),
# auto-detection is performed. The available drivers are nvidia and nouveau
# (See also the driver-specific sections below)
Driver=nvidia
# Directory with a dummy config file to pass as a -configdir to secondary X
XorgConfDir=/etc/bumblebee/xorg.conf.d

## Client options. Will take effect on the next optirun executed.
[optirun]
# Acceleration/ rendering bridge, possible values are auto, virtualgl and
# primus.
Bridge=auto
# The method used for VirtualGL to transport frames between X servers.
# Possible values are proxy, jpeg, rgb, xv and yuv.
VGLTransport=proxy
# List of paths which are searched for the primus libGL.so.1 when using
# the primus bridge
PrimusLibraryPath=/usr/lib/x86_64-linux-gnu/primus:/usr/lib/i386-linux-gnu/primus:/usr/lib/primus:/usr/lib32/primus
# Should the program run under optirun even if Bumblebee server or nvidia card
# is not available?
AllowFallbackToIGC=false


# Driver-specific settings are grouped under [driver-NAME]. The sections are
# parsed if the Driver setting in [bumblebeed] is set to NAME (or if auto-
# detection resolves to NAME).
# PMMethod: method to use for saving power by disabling the nvidia card, valid
# values are: auto - automatically detect which PM method to use
#         bbswitch - new in BB 3, recommended if available
#       switcheroo - vga_switcheroo method, use at your own risk
#             none - disable PM completely
# https://github.com/Bumblebee-Project/Bumblebee/wiki/Comparison-of-PM-methods

## Section with nvidia driver specific options, only parsed if Driver=nvidia
[driver-nvidia]
# Module name to load, defaults to Driver if empty or unset
KernelDriver=nvidia-current
PMMethod=bbswitch
# colon-separated path to the nvidia libraries
LibraryPath=/usr/lib/x86_64-linux-gnu/nvidia:/usr/lib/i386-linux-gnu/nvidia:/usr/lib/nvidia
# comma-separated path of the directory containing nvidia_drv.so and the
# default Xorg modules path
XorgModulePath=/usr/lib/nvidia,/usr/lib/xorg/modules
XorgConfFile=/etc/bumblebee/xorg.conf.nvidia

## Section with nouveau driver specific options, only parsed if Driver=nouveau
[driver-nouveau]
KernelDriver=nouveau
PMMethod=auto
XorgConfFile=/etc/bumblebee/xorg.conf.nouveau

xorg.conf.nvidia

Section "ServerLayout"
Identifier  "Layout0"
Option      "AutoAddDevices" "false"
Option      "AutoAddGPU" "false"
EndSection

Section "Device"
Identifier  "DiscreteNvidiaj"
Driver      "nvidia"
VendorName  "NVIDIA Corporation"

#   If the X server does not automatically detect your VGA device,
#   you can manually set it here.
#   To get the BusID prop, run `lspci | egrep 'VGA|3D'` and input the data
#   as you see in the commented example.
#   This Setting may be needed in some platforms with more than one
#   nvidia card, which may confuse the proprietary driver (e.g.,
#   trying to take ownership of the wrong device). Also needed on Ubuntu 13.04.
BusID "PCI:01:00:0"

#   Setting ProbeAllGpus to false prevents the new proprietary driver
#   instance spawned to try to control the integrated graphics card,
#   which is already being managed outside bumblebee.
#   This option doesn't hurt and it is required on platforms running
#   more than one nvidia graphics card with the proprietary driver.
#   (E.g. Macbook Pro pre-2010 with nVidia 9400M + 9600M GT).
#   If this option is not set, the new Xorg may blacken the screen and
#   render it unusable (unless you have some way to run killall Xorg).
Option "ProbeAllGpus" "false"

Option "NoLogo" "true"
Option "UseEDID" "false"
Option "UseDisplayDevice" "none"
EndSection

# Section "Screen"
#     Identifier "Default Screen"
#   Device "DiscreteNvidia"
# EndSection

If you run:

sudo update-glx --config glx

What is the selected config? It should be /usr/lib/nvidia/bumblebee.

Does the same problem happen if you choose /usr/lib/mesa-diverted instead?

Finally, do you have another DE to try (Gnome would be best) to help narrow it down?

I've been using /usr/lib/nvidia/bumblebee so far, I tried out mesa-diverted and I have the same result. I've tried this with starting lxdm, manually runing startx to start xfce, and sddm (kde), and all have the same behavior. If you think gdm would help I'll try that out but I would rather not install all of gnome.

/usr/lib/nvidia/bumblebee is the right one (default) when having bumblebee, I wanted to see if removing all traces of nvidia from the path helped.

It is really strange that X is affected by bumblebee when not running through it. Can you get to another TTY when the screen is frozen?

Don't bother with GDM for now if it's a hassle, was just trying to narrow it down. I'll install xfce on my sid partition and see what happens.

I think this is an issue specific to my hardware setup (as descrete graphics cannot be forced on, optimus must be used). When I say 'the screen is frozen', the TTY I am in (I'm manually starting a display manager) stops responding (the cursor stops blinking). I can't switch to another TTY. Even the keyboard caps lock/numlock lights no longer change when I press them, and the SysReq keys no longer work either. The system has to be force powered off.

I just double checked, but ssh sessions freeze too when this occurs.

A kernel hard-lock then, that's a pain. Have you tried nouveau?

maybe nouveau is already loaded and causes tha hang because something doesn't work and Xorg freezes due to messed up modesetting DDX?

With the bumblebee-nvidia package nouveau is blacklisted, so it can't be loaded.

and I hope nvidia is also blacklisted, but Xorg freezes and that usually happens for a bad reason.

My guess is: X loads the nvidia DDX, which autoloads the nvidia kernel driver.

Yes, all the kernel modules are blacklisted. And the nvidia libraries are out of the path (hence my question earlier about update-alternatives).

I dealt with so many users where something was messed up, that I wouldn't rely on anything here. And that nvidia gets loaded also explains why turning the GPU off helps.

In fact for that the nvidia libraries doesn'T need to be in the Path, because the nvidia ddx already is enough and for that different paths are used.

Anyhow, without logs it will be painfull to debug this.

I've tried w/ nouveau and I still see the same issue (but with the workaround (which worked under nouveau) I started to see some weird behavior like some CPU cores sticking at 100%). Also when running optirun I got some permission denied errors with nouveau. I'm not sure if this will help though.

Just to clarify, simply turning the discrete video card ON with bbswitch before starting X11 fixes my issue (but it is a hassle to deal with every time). I'm not sure if there are any ways for me to get logs with this situation, but if there are let me know. When I run startx, the screen freezes before any errors come up, so I'm not sure if there is much I can do.

bumblebee blacklists all the nvidia/nouveau modules by default, and I have nvidia set under the bumblebee.conf, so I think nouvau isn't conflicting? If there is any way to test this I would be happy to do so!

well you don't use bumblee with nouveau, and that support should be removed in bumblebee

@jgkamat what really would help would be the dmesg output. Maybe you can do "dmesg -w" through ssh while you start X and see if you get enough useful output this way.

If dmesg can write it, so will journalctl. If you haven't, enable persistent journal (create /var/log/journal) and then after the freeze reboot and check the previous boot journal with journalctl -b -1

@bluca His machine crashes completly. And on a crash usually error logs can't be written anymore, because the kernel stoped doing anything. Dmesg -w could help us because it immediatly displays messages (even before they get written to disc), but if the network dies too fast, he wouldn't either get this and need to setup netconsole, allthough this also requires a working network.

@jgkamat maybe you have something inside pstore (/sys/fs/pstore)

check here for pstore information:

https://lwn.net/Articles/434821/
https://www.kernel.org/doc/Documentation/ABI/testing/pstore
https://www.kernel.org/doc/Documentation/ramoops.txt

I tried setting up a netconsole (and dmesg -w over ssh) and that dosen't seem to give me any logs either before the freeze. I don't have anything currently inside pstore as far as I can tell. I'm starting to think that this is some sort of race condition where bumblebee tries to turn on the nvidia driver before X starts, but X manages to start before the nvidia card comes online, leading to a lockout (or maybe my hardware can't deal with xorg starting without the nvidia card being on). (running modprobe nvidia before X also makes X start properly, as it also forces the nvidia card on).

@jgkamat could you add a xorg.conf file in /etc/X11 with this content and start X while the gpu is off? https://gist.github.com/karolherbst/1f1bdd1a3822df74097f

and check if your nvidia card also has the 01:00.0 address in lspci. If this works, that means something is loaded which makes your kernel unhappy.

Unfortunately, I'm still seeing the same issue with this config. Just to be sure, I created a new xorg.conf file (as the docs say that none should be present) with that config. My Nvidia card is on that bus. Here's the ouptut of lspci. if that helps:

00:00.0 Host bridge: Intel Corporation Sky Lake Host Bridge/DRAM Registers (rev 07)
00:01.0 PCI bridge: Intel Corporation Sky Lake PCIe Controller (x16) (rev 07)
00:02.0 VGA compatible controller: Intel Corporation Skylake Integrated Graphics (rev 06)
00:14.0 USB controller: Intel Corporation Sunrise Point-H USB 3.0 xHCI Controller (rev 31)
00:14.2 Signal processing controller: Intel Corporation Sunrise Point-H Thermal subsystem (rev 31)
00:16.0 Communication controller: Intel Corporation Sunrise Point-H CSME HECI #1 (rev 31)
00:17.0 SATA controller: Intel Corporation Sunrise Point-H SATA Controller [AHCI mode] (rev 31)
00:1c.0 PCI bridge: Intel Corporation Sunrise Point-H PCI Express Root Port #3 (rev f1)
00:1c.3 PCI bridge: Intel Corporation Sunrise Point-H PCI Express Root Port #4 (rev f1)
00:1f.0 ISA bridge: Intel Corporation Sunrise Point-H LPC Controller (rev 31)
00:1f.2 Memory controller: Intel Corporation Sunrise Point-H PMC (rev 31)
00:1f.3 Audio device: Intel Corporation Sunrise Point-H HD Audio (rev 31)
00:1f.4 SMBus: Intel Corporation Sunrise Point-H SMBus (rev 31)
01:00.0 VGA compatible controller: NVIDIA Corporation GM206M [GeForce GTX 965M] (rev a1)
02:00.0 Network controller: Intel Corporation Wireless 8260 (rev 3a)
03:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. Device 5287 (rev 01)
03:00.1 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 12)

Should that file have gone in /etc/bumblebee/xorg.conf.d instead?

I have a Clevo P650RA/P651RA (and also access to a Clevo P670RA/P671RA) which both have GTX 965M cards as well. This issue could be related to Bumblebee-Project/bbswitch#115

In my case an infinite loop would occur in ACPI. See Bumblebee-Project/bbswitch#115 (comment) for more details if you are interested.

I'm not seeing any issues with suspend to the best of my knowlege (the video card is off before/after a sleep, according to bbswitch, and that works fine for me). These issues could be related though.

I'm honestly pretty stoked at how well this performs (with this workaround in place). but I'm worried that a slight change could break it more. I'm happy to provide any more information if that would help!

EDIT: My laptop is a CLEVO N155RF (sager just rebrands them?)

I've been having the exact same issue with my MSI GE62. If i start X11 with the 960M turned off it will do a hard lock. But if i turn it on first then start X11 it works fine.

I should also note that with Gnome GDM will start fine with the 960M turned off. But once I enter my password to log in to Gnome then it will do a hard lock. I presume this is because GDM is using Wayland?

@jkehler : I'm having the exact same behavior with the same model, except I have a 970M
Created a script that executes after GDM login that starts bumblebee. However, when manually stopping bumblebee service, half of the time it'll totally freeze the system, like it does when GDM attempts to login with discrete card off.

Actually I had just realized I had never actually tried starting Gnome with Wayland instead of X11 to see if it hard freezes. I just tried it now and when using Wayland it worked fine with the 960M turned off. So it definitely appears to just be an issue with X11.

I've had a couple random freezes too. Most of the time, they are triggered by some 'low level' operations, or things involving the graphics card (eg: starting steam, modprobes, even lspci once). This is usually accompanied by some audio garbling for some reason (before hard faulting). If I enable the descrete graphics card via bbswitch then I never have this issue, however.

This is my xorg version, if that helps. I've never tried out wayland, and I don't have the time to test this right now, but If I ever do, I'll post an update here. Isn't wayland supposed to illiminate the need for bumblebee? I'm still fuzzy on that topic though...

X.Org X Server 1.18.3
Release Date: 2016-04-04
X Protocol Version 11, Revision 0
Build Operating System: Linux 3.16.0-4-amd64 x86_64 Debian
Current Operating System: Linux laythe 4.5.0-2-amd64 #1 SMP Debian 4.5.3-2 (2016-05-08) x86_64
Kernel command line: BOOT_IMAGE=/boot/vmlinuz-4.5.0-2-amd64 root=UUID=50a03efa-01f3-4e94-92a9-d4ad458845f0 ro acpi_enforce_resources=lax
Build Date: 05 April 2016  07:00:43AM
xorg-server 2:1.18.3-1 (http://www.debian.org/support) 
Current version of pixman: 0.33.6
    Before reporting problems, check http://wiki.x.org
    to make sure that you have the latest version.

I think X is not much of an issue, but a trigger.

Can you switch to a TTY (Ctrl-Alt-F2), log in and try to power off/on the card manually using bbswitch? Repeat this twice to see if it makes a difference.

sudo tee /proc/acpi/bbswitch <<<OFF
sudo tee /proc/acpi/bbswitch <<<ON
sudo tee /proc/acpi/bbswitch <<<OFF
sudo tee /proc/acpi/bbswitch <<<ON

If that still does not hang, try this (exact output does not matter, only whether it hangs or not):

sudo lspci -vvvs 00:01:0
sudo lspci -vvvs 01:00:0

My guess is that trying to access some PCI configuration registers too fast results in failure. Why exactly this happens is something I have been trying for a week to figure out on a Clevo P651RA/GTX965M. Current key words: PCIe link training failure.

Hello @Lekensteyn
switching gpus manually causes no issue.
Both commands below do not hang either, though second one produces no output at all.

However, I've found that if I disable discrete card at boot with bbswitch, the system won't properly boot; on loading gnome, in freezes; visual artifacts in the console may appear at freeze instant, and nothing but power button answers. All this while being on the integrated intel card.

Warp

@Lekensteyn I finally got around to trying what you had suggested above. Switching to a TTY and repeatedly turning the GPU on and off did not result in any sort of hard lock for me.

But when I ran your second set of commands the first one outputted the following.

01:00.0 3D controller: NVIDIA Corporation GM107M [GeForce GTX 960M] (rev ff) (prog-if ff)
    !!! Unknown header type 7f
    Kernel modules: nouveau, nvidia_drm, nvidia

The 2nd command didn't output anything. But then I ran the first command a 2nd time and it resulted in a hard-lock for me.

@jkehler Personally I boot with parameter rdblacklist=nouveau, and I don't have that issue. We don't have the exact same card model though.

Same problems here. it wont boot with Discrete Graphics OFF, it totally hangs and doing the lspci freeze the system in certain situations.
This is a skylake MSI , with a 970m.

Ill check the laptop DSDT/SDST later to try to find the _OFF/ methods of the nvidia pci.

Is there anything i can do to help with this?

I personally think this is an issue in bbswitch, rather than optimus, but I'm not sure...

I'm also free to test if anyone has any way to debug hard faults 😢

@Warpgamer If you disable nouveau and bbswitch your battery life will drain 2-3x as fast, the fan might spin more often and the heat increases. Writing OFF to bbswitch has no effect if nouveau or nvidia are loaded, check the dmesg output for such events.

@carlinux Can you report the full MSI model and acpidump? If you are affected by the same issue as the Clevo P6xxRxx models, then you can try booting with acpi_osi="!Windows 2015" to disable the faulty firmware code path.

@Lekensteyn
I'll try that, thanks.
The model is MSI GS40 6QE Phantom
And about the acpidump. I guess you're asking for the DSDT and the SDST with the methods right?
attached here:
DLS.zip

And as a workaround for having a working laptop..
I already modified succesfully a DSDT and injected it with Clover in a Hackintosh instalation to deactivate the nvidia card for good.
Attached:
DSDT_noNVidia.dsl.zip

As far as i know the same acpi methods and fixes should work on a Linux machine but I don't know how I could inject/execute them.
Is there a way to launch my own DSDT in a linux machine? i'll investigate about it but .. i ask here anyway.
Thanks

@carlinux Patching DSDT like that should not be needed for Linux. It is possible, but your kernel will be marked as tainted.
I found all ACPI tables in the BIOS from https://www.msi.com/Laptop/support/GS40-6QE-Phantom.html (E14A1IMS.10D) and matched those against your DSDT/SSDT files. The methods look like Bumblebee-Project/bbswitch#134.

Have you tried using nouveau instead of bbswitch? If the problem persists with nouveau, could you apply https://lekensteyn.nl/files/linux-v4.6-pcipm-nouveau-pm2.patch on top of Linux 4.6 and try nouveau again?

Same thing here:
1 - GDM + Wayland starts without any freeze
2 - Starting X with "tee /proc/acpi/bbswitch <<<ON" solves the problem
3 - Starting X without that command freezes with a hard lock;
4 - No logs in any output, just hard reset.
5 - Linux arch 4.6.3-1-ARCH #1 SMP PREEMPT Fri Jun 24 21:19:13 CEST 2016 x86_64 GNU/Linux
6 - SchenkerXMG P506 (clevo) = Nvidia GTX 970 + Intel Skylake
7 - Intel microcode loaded: revision 0x8a, date 06.04.2016

Thanks @Lekensteyn ! I have a Clevo P650RE6 with a 970m. Booting with acpi_osi="!Windows 2015" was the thing that fixed the freezes and hard-locks for me with bumblebee. After months of headaches now I'm able to use my optimus laptop without windows 10. My laptop has the latest bios.

I tried it and I've been playing Talos Principle with primusrun inside a Manjaro Live usb session (Manjaro comes with working bumblebee out of the box) without any issues.

@Zipristin That fixed it for me too! You are officially the best person on the internet! 😄

I don't really have any idea how this works, but it would be nice if this could somehow be worked around within bumblebee (but I don't have high hopes, because this is a kernel option). As of now, the ubuntu 16.04 live disk hard locks for me due to this (in default mode).

(as a side note, running nvidia-smi without optirun hard locked for me too, and with acpi_osi="!Windows 2015", that errors properly, so this looks like it will solve my intermittent hard locks as well).

@jgkamat Booting default ubuntu 16.04 live freezes for me if I boot without nouveau.modeset=0. My boot options are nouveau.modeset=0 acpi_osi=Linux acpi_osi="!Windows 2015"

I would like to know too how this acpi_osi option really works but anyway it looks a bios/firmware bug in the laptop more than a bumblebee bug.

I tried using the acpi_osi="!Windows 2015" option on my laptop and it did not fix the issue for me. I still get hard locks when starting X11 with the Nvidia turned OFF. I presume this fix only works for the Clevo laptops since mine is a MSI GE62 Skylake.

Also here is some interesting lines I found in my boot log. I'm not sure if they are helpful/relevant for isolating the source of the problem.

Jul 17 15:54:11 arch kernel: ACPI: Video Device [GFX0] (multi-head: yes  rom: no  post: no)
Jul 17 15:54:11 arch kernel: input: Video Bus as /devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A08:00/LNXVIDEO:00/input/input11
Jul 17 15:54:11 arch kernel: ACPI Exception: AE_NOT_FOUND, Evaluating _DOD (20160108/video-1241)
Jul 17 15:54:11 arch kernel: ACPI: Video Device [PEGP] (multi-head: no  rom: yes  post: no)
Jul 17 15:54:11 arch kernel: input: Video Bus as /devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A08:00/device:12/LNXVIDEO:01/input/input12

...

Jul 17 15:54:12 arch kernel: bbswitch: version 0.8
Jul 17 15:54:12 arch kernel: bbswitch: Found integrated VGA device 0000:00:02.0: \_SB_.PCI0.GFX0
Jul 17 15:54:12 arch kernel: bbswitch: Found discrete VGA device 0000:01:00.0: \_SB_.PCI0.PEG0.PEGP
Jul 17 15:54:12 arch kernel: ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20160108/nsarguments-95)
Jul 17 15:54:12 arch kernel: bbswitch: detected an Optimus _DSM function
Jul 17 15:54:12 arch kernel: pci 0000:01:00.0: enabling device (0006 -> 0007)
Jul 17 15:54:12 arch kernel: bbswitch: Succesfully loaded. Discrete card 0000:01:00.0 is on
Jul 17 15:54:12 arch bumblebeed[565]: [    4.778685] [INFO]/usr/bin/bumblebeed 3.2.1 started
Jul 17 15:54:12 arch kernel: bbswitch: disabling discrete graphics
Jul 17 15:54:12 arch kernel: ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20160108/nsarguments-95)

@Zipristin The Ubuntu 16.04 kernel might be a bit too old for nouveau and your new hardware. The acpi_osi="!Windows 2015" line works around a firmware incompatibility with Linux (still investigating how to solve this).

@jkehler All messages looks normal (the type mismatch for DSM can be ignored). Since you mentioned a GTX 960M and MSI GE62, I take you refer to the MSI GE62 Apache Pro (6th gen, GTX 960M). If you have not already, can you post the tar.gz following the instructions on https://bugs.launchpad.net/lpbugreporter/+bug/752542?

@Lekensteyn I've uploaded it to the launchpad page and I will also upload it here.

The exact model page is https://www.msi.com/Laptop/GE62-6QD-APACHE-PRO.html#hero-overview

Micro-Star_International_Co.,_Ltd.-GE62_6QD.tar.gz

@Lekensteyn @jkehler I also have a MSI ge62 with a 960M experiencing the exact same behaviour as jkehler. I have tried the same fixes with the same failed results as well and I get the same lockup behaviour running lspci twice (hard lock on 2nd time).

@Zipristin Thank you very much for your fix. It worked for me after I realized that the syntax should be different in Arch Linux. The system was refusing to build the grub configuration. My system is a Clevo, maybe it just fix it for this line.
I would like to share my grub cmdline to show the syntax that worked for me:

GRUB_CMDLINE_LINUX_DEFAULT="acpi_osi="!Windows 2015" rcutree.rcu_idle_gp_delay=1 intel_iommu=on"

The last part is just for VT-X processors and virtualization. I also enabled the latest revision of Intel microcode.

Thank you very much,
All the best,

@jgkamat This issue gets a bit overloaded with different laptops... sorry for that. Can you also follow the instructions on https://bugs.launchpad.net/lpbugreporter/+bug/752542 for obtaining the required information?

@jkehler It looks like your MSI GE62 Apache Pro has a similar PGON function definition, except that it does not run into an infinite loop. You shouldn't be seeing AML_INFINITE_LOOP, but I expect that your card will stay off (causing lock ups in the nouveau and possibly nvidia drivers). Can you provide a full dmesg when this occurs? And unfortunately the Clevo workaround does not work for you, the other code could be triggered if you somehow force that Windows 2009 (Win7) is the highest reported value for acpi_osi. Possibly by acpi_osi="!Windows 2012" acpi_osi="!Windows 2013" acpi_osi="!Windows 2015" (disabling Win 8, 8.1 and 10).

Here is the file output, let me know if you need anything else!
Notebook-N15_17RF.tar.gz

Also regarding quotes, I used single quotes around double quotes: GRUB_CMDLINE_LINUX_DEFAULT='acpi_osi="!Windows 2015"'.

@Lekensteyn I'm not exactly sure how I can provide you a full dmesg if the laptop is doing a hard lock.

However, I tried your suggestion of forcing acpi_osi to Windows 2009 only by using acpi_osi=! acpi_osi=Windows 2009 and I am no longer getting a hard lock when starting X11 with the Nvidia turned OFF.

Thanks a bunch for the suggestion! Is there any major caveats to this workaround though that you know of?

@jkehler This just worked for me as well! I am so happy! Also @sylvio-neto ignore what I said earlier about the ! in grub I was completely misunderstanding the documentation. (english not my first language). ! means remove things but without it adds them or something.

Thank you @Lekensteyn !!!

@jgkamat Based on your acpidump, I can confirm that the same issue exists on your Clevo N155RF laptop and that the workaround acpi_osi="!Windows 2015" will also work for you (as you reported before). I took the liberty to post it to the DSDT bug as well.

@jkehler Do you also have a hard lockup when logging into a console with the nouveau driver (not the nvidia blob)? Can you try to reproduce it without X as that touches so many things in the stack that a hard lockup is more likely to happen. So, switch to a console, modprobe nouveau, wait for at least five secs for the runtime PM to kick in. Then execute lspci -d10de: (which might cause a hang of the command, but executing, say, dmesg > dmesg.txt; sync should still be possible).

As for side-effects, maybe there are some other code paths that are less efficient, but normally it should not be too bad. This is really a workaround until the root cause is found (currently comparing PCI config space dumps from Windows with Linux, hopefully that yields something).

I'm having the exact same issue on a different make and model, a Gigabyte P35W V5.

It's a similar situation with Skylake integrated graphics and nVidia Geforce 970M dedicated and if I turn the nvidia card off I get a hard-lock on login via GDM. I'm running Fedora 24 with the latest Kernel, 4.6.4.

I've tried excluding Windows 10 with acpi_osi="!Windows 2015" but I got the exact same lockup with the nvidia card OFF at login so I suspect it's a slightly different issue.

I've included the generated dump from the launchpad bug page in hope that it helps shine more light on this issue.

GIGABYTE-P35V5.zip

O hey @DewaldV, was processing my mail queue in a LIFO order :-p I replied to your Launchpad post, you have to use acpi_osi=! acpi_osi="Windows 2009" for your machine. Thanks for reporting this, it is really helpful to see affected machines from different manufactures.

All three affected cases have one thing in common, they use AMI BIOSes:

  • Clevo P651RA (i7-6700HQ/GTX 965M, part of the P6xxRx family which are also affected). Workaround: acpi_osi="!Windows 2015"
  • MSI GE62 Apache Pro (i7-6700HQ/GTX 960M). Workaround: acpi_osi=! acpi_osi="Windows 2009"
  • MSI GS60 Ghost Pro (i7-6700/GTX 970M). Reported in #764 (comment). Workaround: acpi_osi=! acpi_osi="Windows 2009" info
  • Gigabyte P35V5 (i7-6700HQ/GTX 970M). Workaround: acpi_osi=! acpi_osi="Windows 2009"
  • Razer Blade 14" (2016) (i7-6700HQ/GTX 970M) (BIOS 5.11, 04/07/2016). Workaround: acpi_osi=! acpi_osi="Windows 2009"

(2016-08-26) Now there is also a report for a Dell laptop in Bumblebee-Project/bbswitch#137 (comment):

  • Dell Inspiron 7559 (i7-6700HQ/GTX 960M) (BIOS 1.1.3, 11/05/2015). Workaround: acpi_osi="!Windows 2015" (acpi_osi=! acpi_osi="Windows 2013" and acpi_osi=! acpi_osi="Windows 2009" should also work) info

(2016-10-20) Got a friend which has this problematic HP laptop:

  • HP ZBook Studio G3 (i7-6700HQ/Quadro M1000M) (BIOS N82 Ver. 01.07, 04/27/2016). Workaround: acpi_osi=! acpi_osi="Windows 2009" info
  • (added 2016-10-23) Asus X550VX (i7-6700HQ/GTX 950M). Reported in #764 (comment), workaround: acpi_osi=! acpi_osi="Windows 2009" info
  • (added 2016-11-01) Asus N501VW (i7-6700HQ/GTX 960M). Workaround: acpi_osi=! acpi_osi="Windows 2009" info
  • (added 2016-11-11) Asus GL552VW (i7-6700HQ/GTX 960M). Workaround: acpi_osi=! acpi_osi="Windows 2009" info

With no acpi_osi workaround available:

  • (added 2016-11-11) MSI GE72 2QE Apache Pro (i7-5950HQ or i7-5700HQ / GTX 965M). BIOS up to E1791IMS.113 is likely affected. info

Linux kernel bugreport: https://bugzilla.kernel.org/show_bug.cgi?id=156341

@Lekensteyn That worked perfectly! Thanks for the quick response. :)

Looks like we have at least some common cause then. Let me know if I can help in any way in the future, I'd be happy to learn something new, hehe.

Hi,
I have the same issue on a Razer Blade 2016 with an Intel Core i7–6700HQ and a NVIDIA GeForce GTX 970M. I am running Arch linux with the latest 4.7 kernel and I am only using bbswitch without the bumblebee daemon to switch the discrete card off. Xorg hangs if the discrete card is switched off before starting it. If I use Wayland it works, but hangs later on during random actions... Right now I have to manually disable/enable the discrete card. Hope we can get this fixed soon.

EDIT 1: The latest 4.7 kernel fixes my issues with the open source nouveau driver. I removed the blacklisted nouveau entry, stopped bbswitch from controlling the discrete card and loaded the nouveau driver during boot. Now nouveau takes care of the power management of the dicrete card and switches it off after a short delay. Xorg does not freeze anymore, even if the discrete card is switched off before starting. Somehow the kernel does not like it, if the discrete card is not managed by a kernel module...

EDIT 2: After a suspend action the nouveau driver mixes something up and the system freezes again. So this is not a possible solution...

I encountered the same problem on a HP ZBook Studio G3 running Arch Linux 4.7.0-1. A freeze occurs when bumblebee daemon is started before xserver, while having bbswitch installed.

Bumblebee automatically turns of the discrete graphic card if bbswitch is installed, this is normal and documented behaviour. Starting xserver with the graphic card turned off results in a freeze.

Two workarounds have worked for me:

  1. Starting bumblebee daemon after xserver.
  2. Changing the ACPI OS identification to Windows 2009 with the kernel parameters acpi_osi=! acpi_osi="Windows 2009". This caused brightness control to stop working for me, adding acpi_backlight=native fixed this.

I will test the workaround provided by @Lekensteyn and report back.
Edit: acpi_osi="!Windows 2015" did not work for me.

@m4ng0squ4sh Please post and link your acpidump+lspci+dmidecode information per https://bugs.launchpad.net/lpbugreporter/+bug/752542 for further investigation. If the issue is unrelated to this one, you can try Linux 4.8-rc1 or newer to see if the PR3 improvements in nouveau/PCI help you.

@LahayeChris Same question, please post the details and optionally try 4.8 kernel with nouveau and no bbswitch.

@Lekensteyn Here is my tarball containing the requested information.

I will try the 4.8 kernel soon...

@m4ng0squ4sh Guess what, your laptop also has AMI BIOS. Try acpi_osi=! acpi_osi="Windows 2009" to workaround the lockups (found in SSDT5). I have added your laptop to #764 (comment)

system-manufacturer   : Razer
system-product-name   : Blade
system-version        : 4.04
bios-vendor           : American Megatrends Inc.
bios-version          : 5.11
bios-release-date     : 04/07/2016

@Lekensteyn Thanks! The workaround fixes the lockups, but my touchpad does not work anymore. Any hints why acpi_osi=! causes this? Can you explain, why this command line fixes the lockups? Would it be possible to fix this upstream? Where to start and dig to try to get this fixed?

Edit: I tried acpi_osi=! acpi_osi="Windows 2015" and my touchpad worked, but the system freezed again.

Hi,

I've got an ASUS GL502VT laptop with NVIDIA GTX 970M GPU and I'm on Debian Stretch. I've got the same issue, I've installed nvidia drivers and bumblebee but when I try to login with X11 the laptop freezes, GPU fans are spinning at max speed as soon as I install the nvidia drivers too. If i login with GNOME with Wayland I don't have this issue, if I disable bumblebee service before logging in with X11 it doesn't freeze too. I've tried the solutions posted here before but I'm still experiencing this issue and I haven't find any other solution on the web. I have to say I'm not that experienced with Linux so if there are other info who could help to resolve this problem I have I'll be happy to share them. Thanks.

@nicokeet Quote: Please post and link your acpidump+lspci+dmidecode information per https://bugs.launchpad.net/lpbugreporter/+bug/752542 for further investigation.

@Lekensteyn I got it fixed for the Razer Blade 14 2016 models. I read and tried a lot and finally fixed it by patching the ACPI DSDT firmware. Here is a link to the source.

However I still think we should fix this in the kernel. If you have any hints where to start digging, please let me know.

@m4ng0squ4sh I also agree that it should be fixed in the kernel, but it would need more debugging. I wrote some notes at https://github.com/Lekensteyn/acpi-stuff/blob/master/Clevo-P651RA/notes.txt

Probably should ask the linux-pci list for assistance.

@Lekensteyn Thanks! I'll do that. Hope this gets fixed soon. I'll report back if there are any news.

@m4ng0squ4sh Thanks, I posted my tar. Hope this gets fixed soon.

MSI GS60 6QE Ghost Pro (i7 6700HQ/GTX 970M/AMI bios) is also affected by this issue.
Running Arch Linux and the system works as expected after applying the workaround.
Thanks!

Didn't get a respond so far from the Linux PCI mailing list. Mails get easily lost, if not addressed directly to the sub-maintainer... I'll dig into the Linux kernel source by my self, as soon as I have some spare time.

I see your post is at https://www.spinics.net/lists/linux-pci/msg53521.html
Maybe people are still on holiday or did not know yet what pointers to give. I have meanwhile updated the post above, it seems that there is also a Dell machine affected.

@Lekensteyn might be... Thanks, I updated the affected machines list and send it to the mailing list.

Hi everyone,

I am using Dell laptop mentioned in #764 (comment). @Lekensteyn, I have checked the workaround you provided. Adding acpi_osi="!Windows 2015" worked and Xorg server started properly. Bumblebee works and it is switching graphics cards properly when using optirun (after installing version from git).

Just discussing this with some devs from the mailing list and we discovered some more details...

Steps to reproduce this bug:

  1. Load nouveau.
  2. Wait for it to runtime suspend.
  3. Invoke 'lspci', this resumes the Nvidia PCI device via nouveau.
  4. lspci never returns, few moments later an AML_INFINITE_LOOP is
    reported.
commented

I have my laptop Dell 7559 ı was living same problem but it solved acpi_osi="!Windows 2015"

thanks for everyone

Fixed the freezeon starting X Gnome session on my Clevo P651RP6-G (2016, GTX 1060, originally shipped with Windows 10) via acpi_osi=! acpi_osi="Windows 2009" boot options.
Thanks everyone!! <3

EDIT: Could not change the display backlight brightness after removing all the options via acpi_osi=!, but this can be fixed by adding acpi_backlight=vendor.

I have the same issue. ASUS GL552VW I use Debian stretch, kernel 4.6.0.

The hack acpi_osi=! acpi_osi="Windows 2009" works for me but seems Nvidia card isn't using. Hope it helps!

00:02.0 VGA compatible controller: Intel Corporation HD Graphics 530 (rev 06)
    Subsystem: ASUSTeK Computer Inc. HD Graphics 530
    Kernel driver in use: i915
    Kernel modules: i915
--
01:00.0 3D controller: NVIDIA Corporation GM107M [GeForce GTX 960M] (rev ff)
    Kernel modules: nvidia

ASUSTeK_COMPUTER_INC.-GL552VW.tar.gz

Clevo P640RE (Nvidia GTX970M + Intel Skylake i7-6700HQ)

Was having the similar issue:

  1. lspci # everything is ok
  2. modprobe bbswitch
  3. echo OFF > /proc/acpi/bbswitch
  4. lspci # system freeze

I have added acpi_osi=! acpi_osi="Windows 2009" to boot parameters and the system does not freeze anymore when I run lspci.

@greggy Your video card seems off based on rev ff, if you are happy enough with Intel graphics, do not bother loading the proprietary nvidia driver, it will break stuff.

@vrobolab acpi_osi="!Windows 2015" should already be sufficient for your model.

@Lekensteyn if I run something like optirun glxgears -info the outpur changes to:

01:00.0 3D controller: NVIDIA Corporation GM107M [GeForce GTX 960M] (rev a2)
    Subsystem: ASUSTeK Computer Inc. GM107M [GeForce GTX 960M]
    Flags: bus master, fast devsel, latency 0, IRQ 133
    Memory at de000000 (32-bit, non-prefetchable) [size=16M]
    Memory at c0000000 (64-bit, prefetchable) [size=256M]
    Memory at d0000000 (64-bit, prefetchable) [size=32M]
    I/O ports at e000 [size=128]
    [virtual] Expansion ROM at df000000 [disabled] [size=512K]
    Capabilities: [60] Power Management version 3
    Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
    Capabilities: [78] Express Endpoint, MSI 00
    Capabilities: [100] Virtual Channel
    Capabilities: [250] Latency Tolerance Reporting
    Capabilities: [258] L1 PM Substates
    Capabilities: [128] Power Budgeting <?>
    Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
    Capabilities: [900] #19
    Kernel driver in use: nvidia
    Kernel modules: nvidia

Seems it works. Btw I tested my machine with 4.8 kernel from ubuntu guys, nothing changes. After startx the system freezy.

@greggy ssdt4.dsl from your tarball suggests that acpi_osi="!Windows 2015" is also sufficient. You should not run startx when the video card is off. Having nvidia loaded when the card is off is also upsetting your machine. Just don't do that.

@Lekensteyn do you mean don't start X server (startx) when

# cat /proc/acpi/bbswitch 
0000:01:00.0 OFF

? But what is the best way to do it?

@greggy Just run X with Intel graphics, you won't gain anything from running X on the Nvidia GPU since you cannot see it (unless you use PRIME). If you use want to use PRIME, disable bbswitch (echo ON > /proc/acpi/bbswitch optionally followed by modprobe -r bbswitch).

(If you consider using nouveau, note that kernel 4.6 is too old to support your hardware. 4.7 is needed for nouveau to recognize your card, 4.8 has improved further on stability and was released yesterday.)

I have Razer Blade '14 2016 and I tried the acpi_osi=! acpi_osi="Windows 2009" workaround; it doesn't crash anymore, but it causes touchpad to stop working; also it causes some other weird malfunctions i.e. when I tried to compile vlc, i got segfault when it was running vlc-cache-gen; probably it's related with intel-microcode stuff being broken by the acpi_osi flag. So the workaround is kinda not usable for me...
Changing Windows 2009 to Windows 2015 doesn't work for me, with Windows 2015 it crashes. I have Ubuntu 16.10 with kernel 4.8.

I checked what happens if I have only bbswitch installed, without bumblebee, nouveau and nvidia drivers.
After turning off the card with bbswitch and running glxinfo two or three times in quick succession, the whole system dies; so it seems like it's bbswitch's issue...

Btw I would really like to try the @m4ng0squ4sh's solution: https://github.com/m4ng0squ4sh/razer_blade_14_2016_acpi_dsdt, but I can't get it to work with the usual grub. Any help would be appreciated.

Asus GL552VW (Nvidia GTX960M, Intel Skylake i7-6700HQ), BIOS - American Megatrends, Fedora 24, kernel 4.7.5-200.fc24.x86_64
acpi_osi="!Windows 2015" did not help, but acpi_osi=! acpi_osi="Windows 2009" did the trick, thank you.
ASUSTeK_COMPUTER_INC.-GL552VW.tar.gz

@avico I have the same laptop and behavior. I think we need patch DSDT using link The only problem can't find dsdt editor to apply patches automatically, manually I can miss something. Maybe we should cooperate with you?

@lemourin Here is a link which shows an example with Grub. The DSDT firmware fix has a small problem with screen brightness... I have to fix this soon.

Edit: Why do you use Grub on the Razer Blade?

@m4ng0squ4sh I'm using Grub2 because I'm using Ubuntu 16.10.
According to this bug report: http://savannah.gnu.org/bugs/?35238
Grub2 doesn't have support for multiple initrd images ;_;
Guess I'll have to install arch / use grub legacy ;d

Edit: installed systemd-boot under ubuntu and managed to load the acpi_override.img file.
Unfortunately, my Razer can't boot with those fixes, here I have some photos of what happens, can't do a screenshot while system is stuck D;
image
The whole log repeats itself every ~20seconds. I'm sure I have Razer Blade 14' 2016; so I guess your fixes work only for Arch ;x. I was also recompiling the linux kernel with your dsdt.hex file, but the result was exactly the same.
After next reboot I was able to retrieve kinda more valuable logs:
image

In case someone was wondering, I'm having my Razer plugged in to the external monitor.

@lemourin I just fixed the problem with the screen brightness. It is a complete new fix. Instead of overwriting the DSDT, the broken SSDT is now overwritten. Please check out the new repository. This fix is not Arch dependend. If this doesn't work, I'll need your SSDT files to compare our firmwares...

EDIT: Please post and link your acpidump+lspci+dmidecode information per https://bugs.launchpad.net/lpbugreporter/+bug/752542 for further investigation.

@m4ng0squ4sh great news, congrats! Which patches did you use or manually?

@greggy Thanks. I didn't use a patch... I disassembled the firmware, fixed it and recompiled it ^^

@m4ng0squ4sh could you share, please diffs for both changes?

@m4ng0squ4sh With the new fix my Razer can't find my root partition and then shows me (initramfs) shell prompt. At least it doesn't hang xd. I uploaded acpidumps here: https://bugs.launchpad.net/lpbugreporter/+bug/752542/comments/797

@greggy Please have a look at my source repository. The original and the patched files are present. The fix is quite simple and it's all about replacing OSYS with a constant value. Just dig into your DSDT and SDDT files and start searching for OSYS. You also have to find the SDDT file which is responsible for the discrete GPU stuff...

@lemourin Could you post your systemd-boot entry?

@Lekensteyn Do you have any hint, why acpidump and acpixtract create 13 SSDT files, although the system lists only 10 in /sys/firmware/acpi/tables?

@m4ng0squ4sh systemd-boot's ubuntu.conf:

title Ubuntu
linux /vmlinuz
initrd /razer_acpi_fix.img
initrd /initrd.img
options root=/dev/nvme0n1p6 rw

I had to copy vmlinuz, initrd.img and razer_acpi_fix.img files to /boot/efi because otherwise systemd-boot couldn't find them.
I bought my Razer in August this year.
dmidecode: dmidecode.txt

nvme0n1p6 is a little weird, but thats how disk is called in my Razer.

@lemourin We have the exact same Razer Blade. Your boot problem is caused by something different. I recommend to use partition UUIDs:

title Ubuntu
linux /vmlinuz
initrd /razer_acpi_fix.img
initrd /initrd.img
options root=PARTUUID=YOUR_UUID rw

You can extract the UUID with

sudo blkid /dev/nvme0n1p6

@m4ng0squ4sh It doesn't help, worth mentioning is the fact that with such systemd-boot entry file:

title Ubuntu
linux /vmlinuz
initrd /initrd.img
options root=PARTUUID=ac0d4bf4-6dfd-4a92-b795-1a1ac4d96722 rw acpi_backlight=native acpi_osi=! acpi_osi="Windows 2009"

Ubuntu boots just fine(however this workaround sucks because the touchpad isn't working then). With razer_acpi_fix.img initramfs says that it can't find my disk, and in its shell when I type blkid, it shows an empty list...

nvme0n1p6 stands for NVMe SSD.
Maybe the required nvme kernel module does not load.

Possibly a wrong place to report but I have no exact information to open a kernel bug report.
For me 4.8 kernel version has introduced major problems with graphics (i915 driver?) which did not resolved with 4.8.1.
First of all - system hard freeze after some time of inactivity (10-30 minutes), does not even respond to REISUB so I have to power off by long pressing the Power button. Looks like some problem with powersave of Intel graphics because sometimes I saw some error messages about drm or i915 in dmesg but could not remember exact text and did not happen to make a photo. Never saw that before updated to 4.8.
Second - graphic glitches. Hard to describe, it looks like "common" video tearing but appears randomly when I move the mouse cursor or open windows etc. I think I saw something like this before the 4.8 update but not so often and so disturbing.
Third: lspci on 4.8 takes 0.3 to 1 second while on 4.7 it takes ~0.01 second as it should be. Not disturbing but still worth to point.
So i've rolled back to 4.7 tree.

@vrobolab 4.8 is running fine here with i915. What laptop do you have? There is a known issue with some laptops that require an acpi_osi workaround, see https://bugzilla.kernel.org/show_bug.cgi?id=156341

#764 (comment)
I've tried both acpi_osi=! acpi_osi="Windows 2009" and acpi_osi="!Windows 2015".

WOW I have pinned down the screen flickering/tearing issue - it's not about the Linux kernel version or nvidia/bbswitch but about the IOMMU! I forgot that I've enabled IOMMU for 4.8 but had it disabled in 4.7 that's why I thought its the kernel 4.8 problem.
There are 2 lines on the screen where I put the mouse cursor and screen glitches appear - at 885'th and 1019'th pixel from the top. They appear with both Linux 4.7.x and 4.8.x trees. When I disable the IOMMU at the boot the problem disappears. I will upload a videos soon.
Ultra strange I'd say.

@m4ng0squ4sh I covered your branch and got "clean" diff without spaces. Seems you just hardcodes instead OSYS using 0x07D9. It's the same as using workaround acpi_osi=! acpi_osi="Windows 2009". Especially these lines look funny:

-                ElseIf ((OSYS > 0x07D9) && PEGS ())
+                ElseIf ((0x07D9 > 0x07D9) && PEGS ())