You should try disabling Lutris Runtime, or enable it if it is deactivated.
I have looked in the journalctl but I can’t find anything regarding the crash. Looks very clean
Disabling the Lutris Runtime unfortunately didn’t change anything
I hope these journalctl logs help, it’s all that happened while and shortly after the freeze
I checked it quickly it seems to be something about your amdgpu driver. Do you have the latest and greatest with all the bells and whistles?
It’s important to note that amdgpu for me, is not the same as amdgpu for you. Different code paths for different GPU architectures. 100% rock solid, no driver malfunctions (driver crashes, system lockups), with Polaris10 for me, the worst that happens is that I have to flip to another TTY to kill a crashed/frozen game if I can’t cut through with my keyboard. Not so for all GPUs.
The driver itself is in the distro’s kernel (well, that + /lib/firmware/amdgpu) and the graphics APIs are in userspace (Mesa OpenGL + Vulkan, X11/Wayland etc.) Arch is no slouch for any of those.
Onboard R9 3900X + RX 6700 XT card on the bus. Power management nonsense, buffer timeouts in kernel log, failed GPU resets etc.
Does this manifest itself with all gaming?
Hey there!
My system is up to date as far as the official reps allow. Today the new mesa (22.1) released but it was the same problem with it.
D2R is the only game so far that causes any problems. I have just played 3 hours of Deep Rock Galactic, before I played Dying Light 2, Diablo 3, and Killing Floor 2 - no problems anywhere at all, everything runs fine. Only D2R behaves this way.
I also tried the same with X11, everything is the same there. D2R crashes the PC, everything else works flawlessly.
Also I downgraded to 2 older mesa releases before (april and february) and tried at least 6 different proton versions, no change.
Take a look at dmesg output on a fresh boot, and after gaming without the lockup happening. See how much of that is non-showstopping noise (some of those errors could be worked around by the driver and not actually part of the problem)
Played some games now, none of the messeges regarding the GPU and scheduling appeared there. Also it looks clean after a fresh boot
That’s good then, but when that happens, something bad is happening on your PCI Express bus. Also note the USB xhci resets.
May 24 11:23:22 archpc kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out! May 24 11:23:22 archpc kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out! May 24 11:23:22 archpc kernel: usb 1-1: reset high-speed USB device number 2 using xhci_hcd May 24 11:23:26 archpc kernel: usb 1-1: reset high-speed USB device number 2 using xhci_hcd May 24 11:23:27 archpc kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=701219, emitted seq=701221 May 24 11:23:27 archpc kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Main pid 4696 thread Main pid 4773 May 24 11:23:27 archpc kernel: amdgpu 0000:0d:00.0: amdgpu: GPU reset begin! May 24 11:23:30 archpc kernel: usb 1-1: reset high-speed USB device number 2 using xhci_hcd
Or do those USB device resets happen at intervals irrespective of the gpu crash and reset? A malfunctioning device (or something winking in and out of connectivity) is not a good thing to have, if that’s the case.
P.S. Sorry, the line breaks are turned to garbage when you paste into this form, and use the backticks to embed it as “code”. It dutifully puts exactly what the form does with it. It’s easier to read in the pastebin in the previous post.
I have identified the device as my USB mic. But it works as intended, I use it every day without any issues. While gaming, for work. Never had any problem with that. Could it be a problem?
As for the crash, I don’t know what I can try or where to begin, since this gpu crash is only happening in this case and no where else
Have you tried this?
https://wiki.archlinux.org/title/AMDGPU#System_freeze_or_crash_when_gaming_on_Vega_cards
I have seen that the 5700xt is a bit problematic, is that your card?
As I wrote in the first post I have a 6700 XT which should be fine? But maybe it’s wort a try.
Another thing, I saw that disabling screen space refelection fixes a similar issue for Cyberpunk 2077.
I think you should try to set the lowest settings and disable all the post processing. Then keep playing for a full session like that, then activate a increase the settings.
I think someone has a patch for the mesa lib, check the latest comments, you might want to try this.
What I was suggesting with the USB device is, that if it’s losing connectivity constantly like that there could be something wrong with it. It could just be a symptom of the GPU crash on the bus (that’s why I was asking). It could just be a poorly fitting USB connection if it’s doing that constantly.
Want to rule out Mesa/RADV? Consider testing this with AMDVLK. It’s painful on initial shader compiles (built in LLVM backend vs. RADV’s ACO) but may work for some games you’re having trouble with.
It coexists with Mesa/RADV, switchable using the vulkan loader, and can be forced with environment variables per invocation.
https://wiki.archlinux.org/title/Vulkan#Switching_between_AMD_drivers
There’s news! Finally able to play.
In the process finding out what it is I did many things. Of course I reverted my undervolting of the gpu, that was one of my first guesses. So i put everything back to stock values but it still crashed, so I forgot about that. But then I deleted the profile from CoreCtrl completely, more by accident. Created a new one but left everything on “automatic” instead of “manual”. Then it worked! I don’t know what it is because as I said I put every value back to stock before and it didn’t work out but it had to be on “Automatic”! As soon as switched to manual, even without changing anything, it crashes
I reproduced it by just creating a new profile and set it to manual but didn’t change any value…it crashed again. So eh…yeah. I don’t really understand it. I tested my undervolt from before for many hours in nonstop benchmarks, furmark and played a lot of games for many many hours. No problems at all. But D2R even crashed if there is just a manual profile present with novalue changed in it.
Well. I guess I can play now. Thank you all for your time and thoughts about it!
Hey, good news, happy gaming.
That is odd, how just that one game is triggering such a fault, due to such a BIOS bug. I had thought something in it was hitting a driver fault (and a failed gpu reset is always fatal, or soon to be… I’ve had cards/drivers that did that, even in Windows).
And you can reproduce it too. That’s fortunate. I’m glad you discovered it.
But weird things can happen. Years ago I bought an AMD R9 380 card. At the time I had to use the proprietary driver (“fglrx”) on Linux, but I did most of my gaming on Windows back then (and Wine was an abomination to me back then, it wasn’t until Proton that I started to change my mind). I had no end of trouble with that card in Windows, due to “clever” power management, dynamic clocking and stuff. I had a coincidental utility that just happened to completely override that and overclock to the specified profile. The “MSI Gaming App” of all things.
That settled down with catalyst drivers and amdgpu was viable for it on Linux, but a few years later I was having odd lockups (full, hard power off kind). It was happening in Linux, and Windows. One place it was never happening was while gaming. It was always doing desktop stuff, click a menu etc. It would happen a few times a day most days. Never reproducible (it wasn’t the things I was doing that were causing it).
Know what it was? Gkrellm polling sensors 10 times a second. I had everything on there, voltages, fans, temperatures (it87 chip), core temps, memory, disk i/o, and shit I don’t remember. I don’t know if it was the GPU sensor or what (the it87 motherboard driver in kernel was known to be starting to rot) or ultimately a BIOS bug. In Windows, I was running a free sensor app probably using similar code (lmsensors project etc.) and loadable driver and the same problem was happening there with similar circumstances (never while gaming etc.) The common denominator. It took me a long time to connect those dots.
I stopped using sensor displays, to this day. If I want to have a peek, I just type “sensors”.
Ah that’s also very interesting but I had no issues regarding sensor readings so far.
I also thought it was the driver or wine/proton. Yesterday it happened once again while there was much going on onscreen but only this one time since I wrote that it works now and also after that it didn’t happen again today. I still think there is something that is not working so well. I think it may be something that will be fixed in future versions but after playing a lot in the last days I have noticed that the game is really sluggish from time to time. It has definitely huge FPS drops where I can’t really do anything for 1-2 seconds and it scales with how many effects are shown at once or if there are many players at the same time. I don’t know how it would run on windows with the same hardware but I think my system should meet the requirements easily. Also these problems are persisting through any detail settings, even the lowest. So I guess this is an optimization issue but it could be the game itself. I’ve read that some people even with high end hardware have problems sometimes.