Lot of system crash :(

Hello,

First my setup:

  • Ryzen 1700x
  • Msi gaming pro carbon x370
  • 16Gb ram 3200mhz
  • AMD Vega 56 FE
  • dual boot on two different SSD of windows 10 and zorin OS 16.2

Since few months, i experience a lot of horrible crashs, making me loose my unsaved work :frowning:

The crash always happen when i'm watching youtube on my second screen, with Firefox, or sometimes with youtube music video reduced in the task bar.
For hours it will work fine, and then suddenly the system become unresponsive, the vent of the gpu starts to blow like crazy, and then the 2 screens are displaying this : https://user-images.githubusercontent.com/89231731/178824679-b6c82dc8-10cd-4e04-90f2-eefa834290b3.png.

Only the sounds of the video keeps working in the background, and the mouse cursor is still visible, and i' can't do anything than a reset. All shortcut to try to access a terminal aren't working (alt + f2 ; ctrl + alt + f2... nothing work, even ctrl alt del).

I've already disabled hardware acceleration from Firefox (and the result is that i can watch Netflix without crash, because yes, it was crashing previously like youtube when the FF hardware acceleration was ON).

I'm desperate, because i really enjoy zorin and i dont use anymore windows, but i can't continue to loose work and time by theses random crashs.

I'm not Linux expert at all, but i have long experience of windows. I suspect my issue to be caused by the actual mesa drivers provided by zorin : mesa 21.2.6. I didn't have issues in the first months on my zorin adoption.
How can i install the last mesa driver v22.3.4 and why it's not pushed directly by zorin os updates ?

I've see a lot of tutorial about how to install it on ubuntu, but i dont want to mess my install of zorin, i've spend hours and days on setting my OS for multimedia and music production (and dawn, i enjoy far more to create with free tool on linux ;)).

Help appreciated.

Zorin OS relies on stability, not cutting edge drivers that are unstable.

Mesa driver v22.3.4 is the most cutting edge version available. It relies on system dependencies that are also cutting edge.
Given your specs, it is hard to say whether getting that latest driver would help you... But I suspect it won't. AMD's Vega is... faulty off the factory line.

Many of them often cannot handle clock speeds greater than about 1620Mhz. Many users set that as their maximum clocking speed in order to work around the issue.

Perhaps, and yet your troubleshooting steps and disabling Acceleration in Firefox were expert. :wink:
I believe the trouble you are having is the card - and trickier than a simple setting adjustment to solve.

Hello, thanks for quick answer, i know Vega have a lot of issue... but maybe i should revert my drivers to previous version then ?
How i can use newer or older version of mesa drivers without "breaking" the distro install ?

I really had no issue at all for month last years, i face the issue since there was a mesa drivers update pushed by zorin (but i can't remember when). The problem is there only for 2/3 month i think.

How can set maximum speed clock to my GPU ? I know how to do it on windows because tons of tools allows voltage/frequencies manipulation, but on zorin ??

If you cannot find the settings for the clock speed- then yes. Roll back to the previous drivers that still contains the clock settings.
Which is another example of way it helps to be cautious about upgrading to the Latest- the latest drivers often contain regressions or Removal of Features and User accessibility.

I am not sure of the steps - I have never used AMD graphics...

You should be able to get the previous or proper AMD Drivers here:
https://repo.radeon.com/amdgpu-install/22.40/ubuntu/focal/amdgpu-install_5.4.50401-1_all.deb

It is a self installing .deb package, you only need to run it to install. This should contain the settings you need; I do not know how to access them. We can try searching the web.

Have you also checked if you are using the hardware enablement stack kernel? (HWE)

sudo apt install linux-generic-hwe-20.04 && sudo apt install mesa-utils

https://www.reddit.com/r/Amd/comments/enktj3/potential_fix_to_people_with_vega_5664_crashes/

After setting my memory run at around 500mhz constantly and my core clock at 1337mhz, I have yet to have a black screen or any crash so far.

So after testing with Metro Exodus for about an hour or so I feel like it's safe to say that my GPU for some reason cannot handle cor clock speeds higher than 1622Mhz although VRAM stays stable at 920Mhz I find this really odd but I believe I finally found a solution to the issue via slightly lower clock speed but with slightly more voltage than I would expect I need but now temps stay steady at around 75 degrees and core clock state at about 1543Mhz and I have only seen about a 3 FPS drop in metro exodus and I honestly don't mind if that's what it takes to not crash.

I'm running a vega 56, after much trouble with new world, I discovered it stopped when I turned off the 'surface format optimization' setting(in the games tab, new world settings, in the advanced category).

Ok so i seen so many with vega 56 and 64 having problems.. me also. so here is what i did that fixed all of this. i changed thermal paste..... i opened up the GPU just to check. and it was really crusty and bad paste on it.. this was my finaly try before Buying new Card.. but it works great now. no problem what so ever. i recommend changing the thermal paste even if it not an very old card... ...they put very cheap and bad paste on it.. its stupid.. i wasted so many months trying to figure this out...

This is for a Gigabyte Vega56, but if you can find a newer firmware update for the GPU on your MSI board:
https://www.reddit.com/r/Amd/comments/g4asqi/psa_how_to_fix_your_unstable_gigabyte_vega_56/

Apparently, after you've changed your thermal paste, you should install the card, warm it up, power off and take it out while it's still hot, and alternately torque (don't just torque one screw then the next, successively tighten each of the screws a bit at a time) the mounting screws to get an even pull and thus even energy transfer from chip to cooler.
https://www.reddit.com/r/Amd/comments/a5ldj1/comment/ebor11f/

So, this technique brought my temps down from 105°C max to about 85°C max on the Hotspot. The most important parts are the application of thermal paste and the order of screwing in your bracket.

Not exactly sure if you have the AMD 1700x or AMD Pro 1700x:
Pro 1700x BIOS:
https://www.msi.com/Motherboard/X370-GAMING-PRO-CARBON/support

1700x BIOS:
https://download.msi.com/bos_exe/mb/7A32v10.zip

And I'm not sure exactly what video card you've got. You stated AMD Vega 56 FE, where I'm assuming FE = Frontier Edition:
https://www.amd.com/en/graphics/workstations-radeon-pro-vega-frontier-edition

The Vega FE has two BIOSs on it, keep one stock, use the other for testing. Some people have put a Vega64 BIOS on the Vega56 GPU. Doing so allows one to raise the voltage (apparently the Vega56 runs about 31 mV too low to be stable under heavy load).

@Mr_Magoo keep looking around and you can also find users asking on this forum before as well as on the AMD helpdesk with the same problem.
AMD's 56 is a fun one...

Yeah, those cards seem to be problems from the factory... mismatched chip height (on some cards) necessitating potting the whole thing so the cooler fits, buggy firmware, thermal paste that dries out and stops efficiently transferring energy from chip to cooler, irregular reporting of chip temperatures...

OP, if you do re-paste your GPU, I recommend diamond paste. Diamonds have an incredibly high heat transfer coefficient, making diamond paste more efficient at moving energy from chip to cooler. Back when I did custom builds for people, it's all I used. And even if the paste dries out, the diamonds in the paste are doing the transfer of energy, not the paste itself. Just remember, you want the paste thin and uniform. I used to polish the tops of CPUs (and of course, the bottom of the CPU cooler) before applying the paste, on an industrial flat lapping block, to get it as flat as humanly possible, then apply a layer of paste so thin you could see through it... just enough to fill in any surface irregularities. Ideally you'd want perfect chip-to-cooler contact, and if that were possible you wouldn't need any paste. It's not possible.

Oh! This card has a switch! It's used to switch between "Stock" mode and "Performance" mode. It switches between the two BIOSs.

The OP should check to see which position that switch is in, and if it's in "Performance" mode, try the "Stock" mode.

1 Like

@Aravisian :

  • i don't use official AMD linux drivers, they have reputation to be even worst than on windows while gaming. So i'd like to to use previous version of Mesa Drivers.
  • mesa utils is installed, but "linux-generic-hwe-20.04" is not, is it safe to install, i mean, it's kernel, that i have to manually pick up at start (like the realtime kernel i use when making music production ?). Even if it's what i think, i dont think it might help ?

@Mr_Magoo : i have absolutely no issue with gaming. All my games works like for a charm (native linux, or with proton). Most the time, performance are even better than on windows 10 (and same hardware/same computer), thanks to vulkan and Vega ;). I've never experience black screen, so it's not like usual memory leak we could face on windows. It's more a colorful screen with tons rectangles 4px*8px, like the link i provided in first post.
The GPU was never repasted, and temps never goes higher than 85°.
Mobo bios is up to date too.

And "FE" stands for Founders Edition, it's produced directly by AMD, no custom OC or custom cooling system. I know few people are flashing their Vega 56 with vega 64 Bios, but i don't want try this. It might cause even more troubles.

You talk at last about a switch between performance and stock mode, where/what is that switch ? Is it possible to set/force the value in one mod manually ?

At last my question is : what wasthe previous version of mesa drivers given by ZorinOS, and how to install instead of my actual problematic version ?

EDIT : more troubles:

  • the glxinfo command tells me i have mesa 21.2.6
  • i've installed GPU-Viewer 2.0, a GUI to check GPU informations... and it tells me i use the Mesa 22.3.4 drivers !! i'm lost, what the ... hell ?

i've tried the 2 bios by using the switch, same problem encountered.

From reading of others with the same problem, it appears to be largely fixed if you increase the voltage by 31 mV and slow the clock to below 1622 MHz, and by disabling 'surface format optimization'. As a last resort, if none of the above fixes it, you might be having a hotspot problem (where there is a spot on the cooler not properly contacting the chip) which isn't picked up by the thermistor because it's just a small spot (but it's still messing up the transistors in that spot, causing crashes)... in which case, try repasting.

I'm unsure how to go about changing the settings, as I don't have that card, but if we can find the configuration files, we should be able to figure it out.

The problem is worst and worst, today i had about 10 crashs without warning.

I was not watching videos, but the crash happened half of the time when i had thunderbird notifications for incoming email.

Can some give an advice on how to downgrade mesa drivers please, i'd like to try to use the previous version for while.

I do not believe that anyone can safely give such a guide. Mesa is heavily integrated and trying to downgrade it can cause worse failures. Even if it worked on an earlier version, it did so in combination with packages that have also since been upgraded to later versions.

Have you checked Your System Logs?

You may have another cause than just minor desktop crashes such as segfaults.

You can readily check if there is an obvious hardware fault by running Zorin 16 on a LiveUSB and see if it crashes any. If it does not, it is likely your baremetal install build that has an issue, not the hardware.

Hello, wich system logs should i check ? Please tell me wich files, and where i can find it.

I know perfectly what to seek on windows, but when i search clear information about linux distro, everything is mixed between ubuntu (and older version), arch distros, or pure debian distro.. Where should i look in Zorin ?

With options and variety, so too can you find confusion.
You can readily limit your searches for "Ubuntu 20.04" for any search on Zorin OS 16 - and this should help exclude irrelevant results.
In other areas we are used to this... You would narrow your search to your Computer year make and model if looking for hardware for it.

There is a directory specifically for logging crash events:
/var/crash

But it can be helpful to check boot logs, as well. Perhaps something not initializing also leads toward something cascading to a crash.
And you are not on your own - keep in mind that you can pastebin.com your logs here on the forum for help reviewing them. They can be lengthy and vague.

Hello, My crashs are gone.

What i did ? I saw that i had multiple version of mesa drivers installed. Don't ask me how does happened but i as confused but some tools giving one version number or another.

So i deleted everything, and reinstalled only basic version provided by gnome store. no more issues :crossed_fingers:

1 Like

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.