Sporadic shutdown adventure — day 1

Starting yesterday, my main workstation at home has begun to suffer from random power loss (abrupt system shutdown). The system “reboots” (powers back up) on its own due to my BIOS configuration having a power-on setting of “Last State”.

I wanted to document how I went about troubleshooting this issue and what I did to solve it. At the time of this writing I have no confirmation of what the problem part is, but over time I hope to figure it out.

The problem is directly related to GPU use — more specifically, 3D games. memtestG80 and memtestCL, which are used for testing VRAM/GDDR RAM on video cards (and to some degree stress-testing the GPU itself) works with flying colours for hours, yet within 1-15 minutes of loading any sort of 3D game the system abruptly loses power. memtest86 (for testing system RAM) also runs fine.

Very important things to note:

  • My flat has no air conditioning. Yesterday it was around 92-93F outside (its been like this all week). Lack of airflow in my flat means that during the early evening my thermostat reads 85-86F (really!).
  • The system is hooked up to a UPS. The UPS also has other (minor, low-power-draw) devices on it which do not lose power or go down. This means the UPS is not at fault.
  • The Windows kernel is not crashing. There are no blue-screens, memory mini-dumps, or evidence of driver crashes in the Event Log.
  • Voltages and temperatures for my MSI N560GTX-TI Twin Frozr II card are excellent; there’s no sign of any anomalies shortly before the system loses power.
  • The same with my system CPU — well, temperature are higher than usual, but well within permitted parameters (e.g. I am not anywhere near TjunctionMax for my CPU). The same system has run fine for a few months now, barring bad RAM on my Gigabyte GTX 560 Ti card.
  • I cleaned out the entire PC using a extremely reliable duster and there was no difference (aside from some lower temperatures).

With this information, I’m under the impression I have a piece of hardware (transistor, capacitor, who knows what) “somewhere” within the system that is flaking out intermittently; almost like a short. It’s probably power-related given the symptoms. If my BIOS setting was changed, the system would power down and stay down/off.

The first piece of hardware I’m choosing to replace is my PSU (an Antec TruePower New 650W). My logic is that the PSU may have some sort of internal damage due to the recent extreme heat. The replacement is a Corsair CMPSU-850AX. It should be here later this week.

I’m really hoping the issue turns out to be PSU-related, as the next part to replace is the mainboard. Sadly the Asus P5Q SE is no longer sold/manufactured, which means I’d need to go with another brand/model (probably a Supermicro C2SBX). The problem with switching mainboards is that I’d have to redo my entire Windows XP slipstreaming/driver integration process. Windows, making my life miserable as usual. :-)

Update: I attempted an “extreme” system stress test by running Prime95 in “blend” mode with 4 threads, Furmark 3D in “Xtreme burn-in” mode with PostFX enabled, and Winamp playing a series of MP3s — all simultaneously. This ran for about 30 full minutes. CPU core temps reached 58C, and GPU temps reached 70C… all with full stability. No shutdown.

One thing to note is that Furmark3D doesn’t actually test every single DirectX or video card feature; meaning, for example, Furmark3D is not the same thing as Rift or World of Warcraft. There’s also the possibility that (despite my drive looking healthy) there is something bad going on filesystem or disk-wise where there may be corrupted files of some sort (possibly even DirectX itself).

Before the PSU replacement gets here I’m inclined to rebuild the system (full format, OS reinstall, software reinstall) to see if that fixes it. If not, that’s just one more piece I can rule out.