Constant GPU Hangs/ resetting chip for stopped heartbeat on rcs0

My computer will not come back from a black screen (not s3 sleep) because of a GPU hang.

I used to have this crash constantly on Manjaro and I assumed it was some Arch related thing so I just went back to Zorin since it didn't crash, but now Zorin is also crashing.

This bug has been a thorn in my side for months and nothing I've found online has helped so far.

I can still SSH to the machine and retrieve the following information:

$ uname -r
5.11.0-37-generic

$ lspci | grep -i --color 'vga\|3d\|2d'                                                                                                                                                                                                    
00:02.0 VGA compatible controller: Intel Corporation Xeon E3-1200 v3/4th Gen Core Processor Integrated Graphics Controller (rev 06)
01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Cape Verde XT [Radeon HD 7770/8760 / R7 250X]

$ journalctl --reverse --lines=5
Dec 01 16:04:11 REXTRON-Z kernel: i915 0000:00:02.0: [drm] Resetting chip for stopped heartbeat on rcs0
Dec 01 16:04:11 REXTRON-Z kernel: i915 0000:00:02.0: [drm] GPU HANG: ecode 7:0:00000000
Dec 01 16:04:08 REXTRON-Z kernel: i915 0000:00:02.0: [drm] Resetting chip for stopped heartbeat on rcs0
Dec 01 16:04:08 REXTRON-Z kernel: i915 0000:00:02.0: [drm] GPU HANG: ecode 7:0:00000000
Dec 01 16:04:05 REXTRON-Z kernel: i915 0000:00:02.0: [drm] Resetting chip for stopped heartbeat on rcs0

Can you please navigate to /sys/class/drm/card0 and examine the contents of the error file there?

Here you go:

$ cat /sys/class/drm/card1/error
GPU HANG: ecode 7:0:00000000
Kernel: 5.11.0-37-generic x86_64
Driver: 20201103
Time: 1638386153 s 811104 us
Boottime: 15087 s 919762 us
Uptime: 15087 s 35052 us
Capture: 4298664256 jiffies; 11730684 ms ago
Reset count: 0
Suspend count: 0
Platform: HASWELL
Subplatform: 0x0
PCI ID: 0x0412
PCI Revision: 0x06
PCI Subsystem: 1458:d000
IOMMU enabled?: 0
RPM wakelock: yes
PM suspended: no
GT awake: yes
EIR: 0x00000000
IER: 0xfc080421
GTIER[0]: 0x00401821
PGTBL_ER: 0x00000000
FORCEWAKE: 0x00000000
DERRMR: 0xffffffff
  fence[0] = 00000000
  fence[1] = 00000000
  fence[2] = 00000000
  fence[3] = 00000000
  fence[4] = 00000000
  fence[5] = 00000000
  fence[6] = 00000000
  fence[7] = 00000000
  fence[8] = 00000000
  fence[9] = 00000000
  fence[10] = 00000000
  fence[11] = 00000000
  fence[12] = 00000000
  fence[13] = 00000000
  fence[14] = 00000000
  fence[15] = 00000000
  fence[16] = 00000000
  fence[17] = 00000000
  fence[18] = 00000000
  fence[19] = 00000000
  fence[20] = 00000000
  fence[21] = 00000000
  fence[22] = 00000000
  fence[23] = 00000000
  fence[24] = 00000000
  fence[25] = 00000000
  fence[26] = 00000000
  fence[27] = 00000000
  fence[28] = 00000000
  fence[29] = 00000000
  fence[30] = 00000000
  fence[31] = 00000000
ERROR: 0x00000000
DONE_REG: 0xffffffff
ERR_INT: 0x00000000
available engines: 47
slice total: 1, mask=0001
subslice total: 2
slice0: 2 subslices, mask=00000003
EU total: 20
EU per subslice: 10
has slice power gating: no
has subslice power gating: no
has EU power gating: no
slice0: 2 subslice(s) (0x00000003):
	subslice0: 10 EUs (0x3ff)
	subslice1: 10 EUs (0x3ff)
Num Pipes: 3
PWR_WELL_CTL2: 00000000
Pipe [0]:
  Power: on
  SRC: 077f0437
  STAT: 00000000
Plane [0]:
  CNTR: 41000000
  STRIDE: 00005a00
  SURF: 00000000
  TILEOFF: 00000000
Cursor [0]:
  CNTR: 00000000
  POS: 00000000
  BASE: 00000000
Pipe [1]:
  Power: off
  SRC: 00000000
  STAT: 00000000
Plane [1]:
  CNTR: 00000000
  STRIDE: 00000000
  SURF: 00000000
  TILEOFF: 00000000
Cursor [1]:
  CNTR: 00000000
  POS: 00000000
  BASE: 00000000
Pipe [2]:
  Power: off
  SRC: 00000000
  STAT: 00000000
Plane [2]:
  CNTR: 00000000
  STRIDE: 00000000
  SURF: 00000000
  TILEOFF: 00000000
Cursor [2]:
  CNTR: 00000000
  POS: 00000000
  BASE: 00000000
CPU transcoder: A
  Power: off
  CONF: 00000000
  HTOTAL: 00000000
  HBLANK: 00000000
  HSYNC: 00000000
  VTOTAL: 00000000
  VBLANK: 00000000
  VSYNC: 00000000
CPU transcoder: A
  Power: off
  CONF: 00000000
  HTOTAL: 00000000
  HBLANK: 00000000
  HSYNC: 00000000
  VTOTAL: 00000000
  VBLANK: 00000000
  VSYNC: 00000000
CPU transcoder: A
  Power: off
  CONF: 00000000
  HTOTAL: 00000000
  HBLANK: 00000000
  HSYNC: 00000000
  VTOTAL: 00000000
  VBLANK: 00000000
  VSYNC: 00000000
CPU transcoder: EDP
  Power: on
  CONF: 00000000
  HTOTAL: 00000000
  HBLANK: 00000000
  HSYNC: 00000000
  VTOTAL: 00000000
  VBLANK: 00000000
  VSYNC: 00000000
gen: 7
gt: 2
iommu: disabled
memory-regions: 5
page-sizes: 1000
platform: HASWELL
ppgtt-size: 31
ppgtt-type: 1
dma_mask_size: 40
is_mobile: no
is_lp: no
require_force_probe: no
is_dgfx: no
has_64bit_reloc: no
gpu_reset_clobbers_display: no
has_reset_engine: no
has_fpga_dbg: yes
has_global_mocs: no
has_gt_uc: no
has_l3_dpf: yes
has_llc: yes
has_logical_ring_contexts: no
has_logical_ring_elsq: no
has_logical_ring_preemption: no
has_master_unit_irq: no
has_pooled_eu: no
has_rc6: yes
has_rc6p: no
has_rps: yes
has_runtime_pm: yes
has_snoop: no
has_coherent_ggtt: yes
unfenced_needs_alignment: no
hws_needs_physical: no
cursor_needs_physical: no
has_csr: no
has_ddi: yes
has_dp_mst: yes
has_dsb: no
has_dsc: no
has_fbc: yes
has_gmch: no
has_hdcp: no
has_hotplug: yes
has_hti: no
has_ipc: no
has_modular_fia: no
has_overlay: no
has_psr: yes
has_psr_hw_tracking: yes
overlay_needs_physical: no
supports_tv: no
rawclk rate: 125000 kHz
CS timestamp frequency: 12500000 Hz
Has logical contexts? yes
scheduler: 0
i915.vbt_firmware=(null)
i915.modeset=-1
i915.lvds_channel_mode=0
i915.panel_use_ssc=-1
i915.vbt_sdvo_panel_type=-1
i915.enable_dc=-1
i915.enable_fbc=0
i915.enable_psr=-1
i915.psr_safest_params=no
i915.enable_psr2_sel_fetch=no
i915.disable_power_well=1
i915.enable_ips=1
i915.invert_brightness=0
i915.enable_guc=0
i915.guc_log_level=-1
i915.guc_firmware_path=(null)
i915.huc_firmware_path=(null)
i915.dmc_firmware_path=(null)
i915.mmio_debug=0
i915.edp_vswing=0
i915.reset=3
i915.inject_probe_failure=0
i915.fastboot=-1
i915.enable_dpcd_backlight=-1
i915.force_probe=
i915.fake_lmem_start=0
i915.enable_hangcheck=yes
i915.load_detect_test=no
i915.force_reset_modeset_test=no
i915.error_capture=yes
i915.disable_display=no
i915.verbose_state_checks=yes
i915.nuclear_pageflip=no
i915.enable_dp_mst=yes
i915.enable_gvt=no

Please download this firmware from my cloud:

Extract the file.
Open a root elevated instance of File manager or using the terminal, copy just the files (not the directory folder) to /lib/firmware/i915/
In terminal run:

sudo update-initramfs -u -k all

sudo apt update && sudo apt full-upgrade

Reboot and test...

2 Likes

I followed your instructions and installed the firmware you provided however I got another GPU Hang today while using my system. Thank you for your help thus far by the way, appreciated

Troubleshooting info:
$ journalctl --reverse
Dec 03 10:23:21 REXTRON-Z kernel: i915 0000:00:02.0: [drm] GPU HANG: ecode 7:1:00dfffff, in Xorg [3074]
Dec 03 10:23:18 REXTRON-Z kernel: i915 0000:00:02.0: [drm] Xorg[3074] context reset due to GPU hang
Dec 03 10:23:18 REXTRON-Z kernel: i915 0000:00:02.0: [drm] Resetting chip for stopped heartbeat on rcs0
Dec 03 10:23:18 REXTRON-Z kernel: i915 0000:00:02.0: [drm] GPU HANG: ecode 7:1:00dfffff, in Xorg [3074]
Dec 03 10:23:17 REXTRON-Z NetworkManager[1921]: <info>  [1638548597.2404] device (wlp116s0): supplicant interface state: scanning -> inactive
Dec 03 10:23:16 REXTRON-Z kernel: i915 0000:00:02.0: [drm] Xorg[3074] context reset due to GPU hang
Dec 03 10:23:15 REXTRON-Z kernel: i915 0000:00:02.0: [drm] Resetting chip for stopped heartbeat on rcs0
Dec 03 10:23:15 REXTRON-Z kernel: i915 0000:00:02.0: [drm] GPU HANG: ecode 7:1:00dfffff, in Xorg [3074]
Dec 03 10:23:12 REXTRON-Z kernel: i915 0000:00:02.0: [drm] Xorg[3074] context reset due to GPU hang
Dec 03 10:23:12 REXTRON-Z NetworkManager[1921]: <info>  [1638548592.9052] device (wlp116s0): supplicant interface state: inactive -> scanning
Dec 03 10:23:12 REXTRON-Z kernel: i915 0000:00:02.0: [drm] Resetting chip for stopped heartbeat on rcs0
Dec 03 10:23:12 REXTRON-Z kernel: i915 0000:00:02.0: [drm] GPU HANG: ecode 7:1:00dfffff, in Xorg [3074]
Dec 03 10:23:09 REXTRON-Z kernel: i915 0000:00:02.0: [drm] Xorg[3074] context reset due to GPU hang
Dec 03 10:23:09 REXTRON-Z kernel: i915 0000:00:02.0: [drm] Resetting chip for stopped heartbeat on rcs0
Dec 03 10:23:09 REXTRON-Z kernel: i915 0000:00:02.0: [drm] GPU HANG: ecode 7:1:00dfffff, in Xorg [3074]
Dec 03 10:23:08 REXTRON-Z kernel: Asynchronous wait on fence 0000:00:02.0:gnome-shell[3256]:118f32 timed out (hint:intel_atomic_commit_ready [i915])
Dec 03 10:23:08 REXTRON-Z kernel: Asynchronous wait on fence 0000:00:02.0:gnome-shell[3256]:118f32 timed out (hint:intel_atomic_commit_ready [i915])
Dec 03 10:23:08 REXTRON-Z NetworkManager[1921]: <info>  [1638548588.2567] device (wlp116s0): supplicant interface state: scanning -> inactive
Dec 03 10:23:07 REXTRON-Z kernel: i915 0000:00:02.0: [drm] Xorg[3074] context reset due to GPU hang
Dec 03 10:23:06 REXTRON-Z kernel: i915 0000:00:02.0: [drm] Resetting chip for stopped heartbeat on rcs0
Dec 03 10:23:06 REXTRON-Z kernel: i915 0000:00:02.0: [drm] GPU HANG: ecode 7:1:00dfffff, in Xorg [3074]
Dec 03 10:23:03 REXTRON-Z kernel: i915 0000:00:02.0: [drm] Xorg[3074] context reset due to GPU hang
Dec 03 10:23:03 REXTRON-Z NetworkManager[1921]: <info>  [1638548583.9061] device (wlp116s0): supplicant interface state: inactive -> scanning
Dec 03 10:23:03 REXTRON-Z kernel: i915 0000:00:02.0: [drm] Resetting chip for stopped heartbeat on rcs0
Dec 03 10:23:03 REXTRON-Z kernel: i915 0000:00:02.0: [drm] GPU HANG: ecode 7:1:00dfffff, in Xorg [3074]
Dec 03 10:22:59 REXTRON-Z NetworkManager[1921]: <info>  [1638548579.2686] device (wlp116s0): supplicant interface state: scanning -> inactive
Dec 03 10:22:54 REXTRON-Z NetworkManager[1921]: <info>  [1638548574.9052] device (wlp116s0): supplicant interface state: inactive -> scanning
Dec 03 10:22:50 REXTRON-Z NetworkManager[1921]: <info>  [1638548570.2843] device (wlp116s0): supplicant interface state: scanning -> inactive
Dec 03 10:22:45 REXTRON-Z NetworkManager[1921]: <info>  [1638548565.9052] device (wlp116s0): supplicant interface state: inactive -> scanning
Dec 03 10:22:41 REXTRON-Z NetworkManager[1921]: <info>  [1638548561.2390] device (wlp116s0): supplicant interface state: scanning -> inactive
Dec 03 10:22:36 REXTRON-Z NetworkManager[1921]: <info>  [1638548556.9051] device (wlp116s0): supplicant interface state: inactive -> scanning
Dec 03 10:22:32 REXTRON-Z NetworkManager[1921]: <info>  [1638548552.2458] device (wlp116s0): supplicant interface state: scanning -> inactive
$ cat /sys/class/drm/card1/error

GPU HANG: ecode 7:1:00dfffff, in Xorg [3074]
Kernel: 5.11.0-37-generic x86_64
Driver: 20201103
Time: 1638548583 s 879016 us
Boottime: 6535 s 983559 us
Uptime: 6535 s 103193 us
Capture: 4296526272 jiffies; 317792 ms ago
Active process (on ring rcs0): Xorg [3074]
Reset count: 0
Suspend count: 0
Platform: HASWELL
Subplatform: 0x0
PCI ID: 0x0412
PCI Revision: 0x06
PCI Subsystem: 1458:d000
IOMMU enabled?: 0
RPM wakelock: yes
PM suspended: no
GT awake: yes
EIR: 0x00000000
IER: 0xfc080421
GTIER[0]: 0x00401821
PGTBL_ER: 0x00000000
FORCEWAKE: 0x00000000
DERRMR: 0xffffffff
  fence[0] = 00000000
  fence[1] = 00000000
  fence[2] = 00000000
  fence[3] = 00000000
  fence[4] = 00000000
  fence[5] = 00000000
  fence[6] = 00000000
  fence[7] = 00000000
  fence[8] = 00000000
  fence[9] = 00000000
  fence[10] = 00000000
  fence[11] = 00000000
  fence[12] = 3b340b300b0b001
  fence[13] = 95f50b3065cc001
  fence[14] = cf450b309f1c001
  fence[15] = 00000000
  fence[16] = 00000000
  fence[17] = 00000000
  fence[18] = 00000000
  fence[19] = 00000000
  fence[20] = 00000000
  fence[21] = 00000000
  fence[22] = 00000000
  fence[23] = 00000000
  fence[24] = 00000000
  fence[25] = 00000000
  fence[26] = 00000000
  fence[27] = 00000000
  fence[28] = 00000000
  fence[29] = 00000000
  fence[30] = 00000000
  fence[31] = 00000000
ERROR: 0x00000000
DONE_REG: 0xffffffff
ERR_INT: 0x00000000
rcs0 command stream:
  CCID:  0x7ffa9109
  START: 0x00301000
  HEAD:  0xbaa0097c [0x00000978]
  TAIL:  0x00001208 [0x00000a78, 0x00000a90]
  CTL:   0x00003001
  MODE:  0x00004000
  HWS:   0x7fffe000
  ACTHD: 0x00000000 baa0097c
  IPEIR: 0x00000000
  IPEHR: 0xff000000
  ESR:   0x00000001
  INSTDONE: 0xffdfffff
  SC_INSTDONE: 0xffffffff
  SAMPLER_INSTDONE[0][0]: 0xffffffff
  ROW_INSTDONE[0][0]: 0xffffffff
  batch: [0x00000000_13e2b000, 0x00000000_13e2c000]
  BBADDR: 0x00000000_13e2b32c
  BB_STATE: 0x00000020
  INSTPS: 0x80000101
  INSTPM: 0x00006280
  FADDR: 0x00000000 00301b40
  RC PSMI: 0x00000010
  FAULT_REG: 0x00000000
  GFX_MODE: 0x00002a00
  PP_DIR_BASE: 0x7fda0000
  hung: 1
  engine reset count: 0
  Active context: Xorg[3074] prio 0, guilty 0 active 0, runtime total 0ns, avg 0ns
rcs0 --- WA context = 0x00000000 7ffba000

# garbage text removed - edison

available engines: 47
slice total: 1, mask=0001
subslice total: 2
slice0: 2 subslices, mask=00000003
EU total: 20
EU per subslice: 10
has slice power gating: no
has subslice power gating: no
has EU power gating: no
slice0: 2 subslice(s) (0x00000003):
	subslice0: 10 EUs (0x3ff)
	subslice1: 10 EUs (0x3ff)
Num Pipes: 3
PWR_WELL_CTL2: c0000000
Pipe [0]:
  Power: on
  SRC: 077f0437
  STAT: 00000000
Plane [0]:
  CNTR: d9000400
  STRIDE: 00005a00
  SURF: 09f2b000
  TILEOFF: 00000000
Cursor [0]:
  CNTR: 00000000
  POS: 00000000
  BASE: 00000000
Pipe [1]:
  Power: on
  SRC: 077f0437
  STAT: 00000000
Plane [1]:
  CNTR: d9000400
  STRIDE: 00005a00
  SURF: 09f3a000
  TILEOFF: 00000000
Cursor [1]:
  CNTR: 00000000
  POS: 00000000
  BASE: 00000000
Pipe [2]:
  Power: on
  SRC: 00000000
  STAT: 00000000
Plane [2]:
  CNTR: 00000000
  STRIDE: 00000000
  SURF: 00000000
  TILEOFF: 00000000
Cursor [2]:
  CNTR: 00000000
  POS: 00000000
  BASE: 00000000
CPU transcoder: A
  Power: on
  CONF: c0000000
  HTOTAL: 0897077f
  HBLANK: 0897077f
  HSYNC: 080307d7
  VTOTAL: 04640437
  VBLANK: 04640437
  VSYNC: 0440043b
CPU transcoder: B
  Power: on
  CONF: c0000000
  HTOTAL: 0897077f
  HBLANK: 0897077f
  HSYNC: 080307d7
  VTOTAL: 04640437
  VBLANK: 04640437
  VSYNC: 0440043b
CPU transcoder: C
  Power: on
  CONF: 00000000
  HTOTAL: 00000000
  HBLANK: 00000000
  HSYNC: 00000000
  VTOTAL: 00000000
  VBLANK: 00000000
  VSYNC: 00000000
CPU transcoder: EDP
  Power: on
  CONF: 00000000
  HTOTAL: 00000000
  HBLANK: 00000000
  HSYNC: 00000000
  VTOTAL: 00000000
  VBLANK: 00000000
  VSYNC: 00000000
gen: 7
gt: 2
iommu: disabled
memory-regions: 5
page-sizes: 1000
platform: HASWELL
ppgtt-size: 31
ppgtt-type: 1
dma_mask_size: 40
is_mobile: no
is_lp: no
require_force_probe: no
is_dgfx: no
has_64bit_reloc: no
gpu_reset_clobbers_display: no
has_reset_engine: no
has_fpga_dbg: yes
has_global_mocs: no
has_gt_uc: no
has_l3_dpf: yes
has_llc: yes
has_logical_ring_contexts: no
has_logical_ring_elsq: no
has_logical_ring_preemption: no
has_master_unit_irq: no
has_pooled_eu: no
has_rc6: yes
has_rc6p: no
has_rps: yes
has_runtime_pm: yes
has_snoop: no
has_coherent_ggtt: yes
unfenced_needs_alignment: no
hws_needs_physical: no
cursor_needs_physical: no
has_csr: no
has_ddi: yes
has_dp_mst: yes
has_dsb: no
has_dsc: no
has_fbc: yes
has_gmch: no
has_hdcp: no
has_hotplug: yes
has_hti: no
has_ipc: no
has_modular_fia: no
has_overlay: no
has_psr: yes
has_psr_hw_tracking: yes
overlay_needs_physical: no
supports_tv: no
rawclk rate: 125000 kHz
CS timestamp frequency: 12500000 Hz
Has logical contexts? yes
scheduler: 0
i915.vbt_firmware=(null)
i915.modeset=-1
i915.lvds_channel_mode=0
i915.panel_use_ssc=-1
i915.vbt_sdvo_panel_type=-1
i915.enable_dc=-1
i915.enable_fbc=0
i915.enable_psr=-1
i915.psr_safest_params=no
i915.enable_psr2_sel_fetch=no
i915.disable_power_well=1
i915.enable_ips=1
i915.invert_brightness=0
i915.enable_guc=0
i915.guc_log_level=-1
i915.guc_firmware_path=(null)
i915.huc_firmware_path=(null)
i915.dmc_firmware_path=(null)
i915.mmio_debug=0
i915.edp_vswing=0
i915.reset=3
i915.inject_probe_failure=0
i915.fastboot=-1
i915.enable_dpcd_backlight=-1
i915.force_probe=
i915.fake_lmem_start=0
i915.enable_hangcheck=yes
i915.load_detect_test=no
i915.force_reset_modeset_test=no
i915.error_capture=yes
i915.disable_display=no
i915.verbose_state_checks=yes
i915.nuclear_pageflip=no
i915.enable_dp_mst=yes
i915.enable_gvt=no

If you are comfortable recompiling the kernel, you could use drm-tip.
https://cgit.freedesktop.org/drm-tip

But, you might try using a non-hwe kernel or even the 5.8 LTS kernel:

sudo apt install linux-headers-5.8.0-63-generic linux-modules-5.8.0-63-generic linux-modules-extra-5.8.0-63-generic linux-image-5.8.0-63-generic

To anyone reading this in the future, I never solved this problem, however I did revert to using kernel 4.4.297-1-MANJARO and it works, though that is on Manjaro. Perhaps an even older kernel would work on Zorin as well.

1 Like