We were in the process of upgrading our AD 2003 domain to AD 2012 R2, and as part of that, we were promoting some new virtual Windows 2012 R2 DCs into the 2003 domain. All these virtual DCs were running on VMWare on hosts with plenty of capacity. They were running for at least a week prior to promotion with no problems.
Within a week of promoting our virtual DCs, one server in one site, and then another in another site, had a CPU spike of 100%. The thing flatlined, and you couldn’t even logon to the server (via RDP or the VMWare console). The only fix was to reboot the server. The server that was first affected had the issue appear within a week of being promoted.
We noticed that the problem exe was svchost.exe, and the actual problem service was WinRM. This was being spiked in the middle of the night, around when the backups were being kicked off.
We don’t get monitoring alerts sent to us for these events (yes, we know it’s not good), and so we were only becoming aware of the problem when getting into the office in the morning, and finding a DC had been effectively dead for 8 – 10 hours. (Or more – the first one died over a weekend.)
Naturally we checked each VM instance thoroughly – the affected virtuals were on different hosts, in different cities, with different network trunking and different storage. The VM hosts themselves had tons of memory, storage and CPU capacity. No ballooning or similar things were going on.
There was nothing much in the Windows event logs, other than a few mystery warnings about WinRM, the CPU and so on. Nothing obvious (to us) about what had caused WinRM to spike, backup failure, permissions problems, nothing.
We did log a call with Microsoft Premier Support, who could not apparently help us with diagnostics after the event, but who wanted us to gather logs while the event was going on. Frankly, we felt that gathering logs when we couldn’t even log onto the box while it was dying was a bit tricky, and I had even more doubts about doing remote logging when WinRM was fully occupied!
But as it happened, we never got a recurrence of the problem, due to the aid of Dr Google.
Identifying the culprit
Just before logging the PSS call, I’d found this TechNet blog on WinRM causing “timeouts or delays”. Nothing about complete system freezes, but it seemed to best reflect our issue. We couldn’t find anything else more specific to our issue. And to add more fun, we were on a short time-frame to remove our legacy DCs to do our domain and forest functional upgrades.
When you install a Windows 2012 R2 (or Windows 2008 SP2 (?) and up) server, it creates a local group, WinRMRemoteWMIUsers_. As the group name implies, if you want to connect to WinRM remotely on the server, you must be a member of this group (Administrators are by default).
Now, what happens to this local group when you promote a domain controller into the (legacy 2003) domain? That’s right, absolutely nothing – the new DC inherits the new BUILTIN user groups. Its own local groups are blown away. In a 2003 domain, there’s no BUILTIN group for WinRMRemoteWMIUsers_ … so the new DC no longer has that group. So – possibly – when the system state backup was taking place (including the registry and AD database), WinRM is being called, the Remote WMI group doesn’t exist, the whole thing fails.
Unfortunately, we couldn’t confirm any of this via the logs – there were no permissions failures that we could see, for example. Just the 100% spike within 5 minutes of certain backups commencing. One thing I didn’t check was whether it was the fulls. It was not every backup (thus making it more tricky to troubleshoot).
I simply created a new BUILTIN group in the domain called WinRMRemoteWMIUsers_ – do not forget the two underscores at the end – and the problem entirely went away. Never to be seen again.
We didn’t bother confirming with Microsoft before putting in the group. Or wait for another server to fail before trying the logging in the Technet article (or the logs PSS subsequently suggested). Fortunately, it worked first time.
I meanly left the PSS case open with Microsoft for another 2 weeks in case we had a recurrence – and some correspondence to see if we could confirm that as the root cause (no), and whether we’d taken the right action for the symptoms (yes).
According to the article, LSASS might also cause these issues, which would make sense from a DC point of view. But what we saw was the WinRM.