Sometime last year, my PC started failing to boot the first time I turn it on for the day. It would not POST, and the CPU debug light would be lit. Since we were in the middle of a pandemic and PC component prices were at an all time high due to chip shortages, I decided to just work around the issue by just hard-resetting the machine until it POSTed and booted into Windows.
This was fine for a few months, the hard resets were a minor inconvenience. That was until BSODs started happening. What made the BSODs tricky though is that they would produce random stop codes (
WHEA_UNCORRECTABLE_ERROR being some of them), would not produce dump files, would only happen during low load and, the weirdest behavior of them all, would only happen once during the first time I get into Windows. After resetting, it would not happen again until the next day.
The hard resets for the CPU failing to post was one thing. But the random BSODs, especially when I'm in the middle of something, was something else. I could not ignore it, just had to fix this one.
Ruling out components one by one
So I did some reasearch and had some valuable leads.
The first possibility is an installation corruption from a failed update. We've been getting odd shortages in the middle of the night, and Windows loves to update itself around that same time. But a filesystem scan and even a fresh OS install ruled that out. Could also have been a storage hardware failure, it uses an SSD after all. But for a build that's just 2 years old with a drive that's almost never full and at 95% health, I'd say that's highly unlikely.
Other potential points of failure were bad drivers, bad devices, or both. A lot of online articles point at webcams specifically, including ones manufactured by reputable brands. I wouldn't be surprised though, I recently installed a cheap USB microphone and webcam for work. But this one was also ruled out after a fresh install of Windows and only attaching a keyboard and mouse for a while.
Next up was defective RAM. This one's easy to test, as I only needed to run a memory tester like MemTest86. After running the tool twice for a total of about 6 hours, I got a pass on both runs. I also removed the RAM sticks, cleaned the contacts, and put them back in. Still nothing.
Now it's highly unlikely at this point given that I have never overclocked the machine and the BSODs only happen during low load, I ran some stress tests anyway. I downloaded and ran OCCT, Cinebench R23, Unigine Heaven, thinking that if there was a part of the CPU or GPU that was defective, one of these tests should snag and crash the system. After several hours of testing, nothing came out of it.
Now before I start buying replacement parts for the things I think are broken, I thought I'd try reseating components. For those of you who aren't familiar with the term, it's the act of just removing the part, maybe clean the electrical contacts, and putting the part back in. You know, just like how you pop out a game cartridge off your game console, blow away the dust, and pop the cartridge back in.
In the case of PC components, oxidation is a problem. Electrical contacts corrode over time and form a layer of oxidation. This layer can prevent conduction of electricity by acting as an insulator between the two contacts. Reseating works because the act of unslotting and reslotting scrapes off that oxide layer, restoring conductivity between the contacts. Cleaning the contacts with alcohol and a cotton swab does the same thing.
Losing time and patience, I didn't go methodical this time to find out which remaining parts were to blame. I went all in and just reseated everything at this point: the SSD, the RAM, the CPU, the GPU, the GPU riser cable, all the power cables, even the CMOS battery and front panel cabling just for funsies. Afterwards, I plugged everything back in and hoped that I wouldn't have to buy replacements.
Two weeks later...
There was one instance where the device did not POST and lit the CPU debug light. But that was about it, just that one incident. I don't want to jinx it, but I think reseating the components fixed issue, all without sacrificing an arm and a leg*. While I was surprised that reseating fixed it, I wasn't so surprised why it happened. It has been very humid the last few months, and that might have just contributed enough that the components reached a tipping point.
So the next time you get frequent BSODs with your PC, try reseating components and cleaning electrical contacts. It might just save you from a very distressing situation.
Update #1: I may have just found the cause of the issue... sort of.
I forgot to mention one step during my debugging. I didn't put much thought into this step because it was a workaround, and decided to omit it from the article initially. But given the other pieces of debugging information, I might have just found the cause of the issue. This is why you present all the information during an investigation kids. 😉
At some point during my debugging, I came across interesting articles about my CPU. Due to its revolutionary power efficiency, it was reported that it would become unstable during low load. It would even go under the PSU's idle threshold, making the PSU think it powered down, and causing the PSU to prematurely cut off power. Now this is highly unlikely, given that it's been two years since that CPU's launch. There have been plenty of optimizations on the OS, drivers, and even the BIOS for this CPU's behavior.
However, thinking that this was the cause of my issues, I tried one of the workarounds that was mentioned: Put a permanent +25mV offset on the CPU. The idea is taken from the overclocking playbook where, if the CPU is unstable, you pump in a bit more voltage to it to make it stable. It was also supposedly to get the voltage up a bit so that it's just above idle voltage, preventing the PSU from prematurely cutting off power. While it didn't resolve the CPU debug light in POST issue, it did stop the BSODs consistently.
CPUs are connected to the motherboard via pins. Some of these pins serve the same purpose for redundancy and load balancing, while others have no redundancies and map 1:1 to the connected hardware. This is why some people walk away with broken pins (usually voltage lines) and luckily still have a working CPU, while others aren't so lucky.
In my case, it must have been corroded CPU pins. My best guess is that corrosion could have imparted resistance on some of the pins, and that voltage offset could have just been enough to get over that resistance. However, I don't know for certain which kind of pin was affected with the information I have. It could have affected multiple voltage lines which is highly unlikely, or a single data, debug, or sensor line which is somewhat unrelated to the bumped voltage. That's still up in the air until I recall more information and get another light bulb moment.
Update #2: It might actually be the power cabling
The reason why electrical contacts of very important components is colored gold is because they're actually plated with gold. Gold does not corrode, thus it does not form an oxide layer. This makes it the perfect element for electronics, especially between electrical contacts. This rules out the CPU, the RAM, the SSD, and the GPU entirely from an electrical contact perspective. They could still be defective functionally though.
Now remember that I mentioned that I ran out of patience and just reseated everything? If we look into that list of things I reseated, the list includes the CPU, GPU, SSD, and RAM (whose contacts we know won't corrode), fans and panel I/O (which are unrelated). Well guess what's left?
I reseated the 8-pin and the 24-pin motherboard cables. Those aren't typically plated with gold, probably just some corrosion-prone metal like copper. I've seen those before, where the copper would form white to green patches of corrosion. It could also be that the cabling has been damaged. The PC is a small form-factor build after all, the tight bends might have damaged the wiring.
Still doesn't explain why it would BSOD on idle but not on load. One theory is that during load, the robust cabling allows for one or more of the redundant wiring (e.g. one of the many 12V wires) to cover for the faulty line. But on idle, the load is not enough that the voltage dips are more pronounced, going so low and triggering BSODs.
These things become really hard to test with just by ruling out possibilities. The true problem can only be uncovered with the use of specialized tools - which I don't have and don't know to use.
So far, still no BSODs. Fingers crossed.
* While all this happened, I had a sprained ankle. So a sacrificed leg technically? Who knows?