Yesterday I purchased “Diablo 2 Resurrected”. Unfortunately sooner or later the game causes a hard system crash. No alt+tab, no switching tty. I can only push the reset button. Sometimes after a few seconds ingame, sometimes it takes longer but it’s going to happen.
It’s important to note that amdgpu for me, is not the same as amdgpu for you. Different code paths for different GPU architectures. 100% rock solid, no driver malfunctions (driver crashes, system lockups), with Polaris10 for me, the worst that happens is that I have to flip to another TTY to kill a crashed/frozen game if I can’t cut through with my keyboard. Not so for all GPUs.
The driver itself is in the distro’s kernel (well, that + /lib/firmware/amdgpu) and the graphics APIs are in userspace (Mesa OpenGL + Vulkan, X11/Wayland etc.) Arch is no slouch for any of those.
Onboard R9 3900X + RX 6700 XT card on the bus. Power management nonsense, buffer timeouts in kernel log, failed GPU resets etc.
My system is up to date as far as the official reps allow. Today the new mesa (22.1) released but it was the same problem with it.
D2R is the only game so far that causes any problems. I have just played 3 hours of Deep Rock Galactic, before I played Dying Light 2, Diablo 3, and Killing Floor 2 - no problems anywhere at all, everything runs fine. Only D2R behaves this way.
I also tried the same with X11, everything is the same there. D2R crashes the PC, everything else works flawlessly.
Also I downgraded to 2 older mesa releases before (april and february) and tried at least 6 different proton versions, no change.
Take a look at dmesg output on a fresh boot, and after gaming without the lockup happening. See how much of that is non-showstopping noise (some of those errors could be worked around by the driver and not actually part of the problem)
That’s good then, but when that happens, something bad is happening on your PCI Express bus. Also note the USB xhci resets.
May 24 11:23:22 archpc kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out! May 24 11:23:22 archpc kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out! May 24 11:23:22 archpc kernel: usb 1-1: reset high-speed USB device number 2 using xhci_hcd May 24 11:23:26 archpc kernel: usb 1-1: reset high-speed USB device number 2 using xhci_hcd May 24 11:23:27 archpc kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=701219, emitted seq=701221 May 24 11:23:27 archpc kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Main pid 4696 thread Main pid 4773 May 24 11:23:27 archpc kernel: amdgpu 0000:0d:00.0: amdgpu: GPU reset begin! May 24 11:23:30 archpc kernel: usb 1-1: reset high-speed USB device number 2 using xhci_hcd
Or do those USB device resets happen at intervals irrespective of the gpu crash and reset? A malfunctioning device (or something winking in and out of connectivity) is not a good thing to have, if that’s the case.
P.S. Sorry, the line breaks are turned to garbage when you paste into this form, and use the backticks to embed it as “code”. It dutifully puts exactly what the form does with it. It’s easier to read in the pastebin in the previous post.
I have identified the device as my USB mic. But it works as intended, I use it every day without any issues. While gaming, for work. Never had any problem with that. Could it be a problem?
As for the crash, I don’t know what I can try or where to begin, since this gpu crash is only happening in this case and no where else
Another thing, I saw that disabling screen space refelection fixes a similar issue for Cyberpunk 2077.
I think you should try to set the lowest settings and disable all the post processing. Then keep playing for a full session like that, then activate a increase the settings.
What I was suggesting with the USB device is, that if it’s losing connectivity constantly like that there could be something wrong with it. It could just be a symptom of the GPU crash on the bus (that’s why I was asking). It could just be a poorly fitting USB connection if it’s doing that constantly.
Want to rule out Mesa/RADV? Consider testing this with AMDVLK. It’s painful on initial shader compiles (built in LLVM backend vs. RADV’s ACO) but may work for some games you’re having trouble with.
In the process finding out what it is I did many things. Of course I reverted my undervolting of the gpu, that was one of my first guesses. So i put everything back to stock values but it still crashed, so I forgot about that. But then I deleted the profile from CoreCtrl completely, more by accident. Created a new one but left everything on “automatic” instead of “manual”. Then it worked! I don’t know what it is because as I said I put every value back to stock before and it didn’t work out but it had to be on “Automatic”! As soon as switched to manual, even without changing anything, it crashes
I reproduced it by just creating a new profile and set it to manual but didn’t change any value…it crashed again. So eh…yeah. I don’t really understand it. I tested my undervolt from before for many hours in nonstop benchmarks, furmark and played a lot of games for many many hours. No problems at all. But D2R even crashed if there is just a manual profile present with novalue changed in it.
Well. I guess I can play now. Thank you all for your time and thoughts about it!