Slow Domain Logon Analysis – Root Cause Identified – Solved!
Root Cause Identified – Application Performance Optimized are the words CIO’s and End Users enjoy hearing. Network Slow, Application Slow – these trouble tickets take more time and are rarely definitively diagnosted to the true root cause.
Transcript for this Video
So the first thing I’m going to talk about is an application log on, that was taking a long time. As a matter of fact it was into the minutes, and sometimes the user would get disconnected and not even be able to log onto the application. Well the first thing we did was took a capture. We took a trace, everybody says, let’s take a trace. Of course you take a trace, and you look at this thing, and it’s like where do I start. It’s a bunch of protocol goulash, and it’s the same thing for me until I zero in on the TCP session or on the particular application or on the particular client or server, it’s protocol goulash. So we have to zero in on these things.Now we found that there was a problem with slow domain logon, and so we kind of suspected that it might have something to do with domain controllers. So I Ioaded a que check agent on the server and on the workstation. Now Qcheck is a freebie from Ixia Communications and you can go out on the web and find cue check out there. You can download it, put it on your server, put it on your workstation, and then you can do a through put or a latency test to find out how it’s performing at layer four. Well we did that and we found that it was giving us very poor performance with other tasks.
Now domain controller usually isn’t doing file copies and those sort of things. It’s basically doing security authentication. When we went to do through put test we found that it was very poor, for copying files or doing the que check. Now what we found when we went in and looked at that poor performance was, and if you take a here, a check sum error. You’ll see here, TCP check sum error incorrect. I blew this up so you could see it. TCP check sum error, or check sum incorrect. Now what that means is, is that the TCP check sum calculated by the server was wrong when it went out on the wire.
Now it can have a good data link control. It can have a good IP header check sum, and a good data link layer check sum, but have a bad TCP check sum, and that’s exactly what we found in this case. Now we wanted to find more occurrences of this and see how often this was occurring, but we still wanted to see all the other packets so we used Ethereal which is a free network analyzer public domain that you can find at Ethereal.com. By going in an doing a find, by a string, you can find in the decode, TCP check sum, and I just put TCP check. Then I click find, and then boom, I would find several TCP check sum errors. So TCP check sum incorrect, check sum incorrect, check sum incorrect, not all but many of these packets had bad TCP check sums.
I thought to myself, what could be possible causing that to occur? And then I went to the machines, server, network interface card. And I found that check sum off load was enabled for both transmit and receive to the network interface card. Now the reason why this has started to be popular is because Microsoft announced in 2001 that they did some test with network interface cards that did the offload of the TCP check sum, and hardware instead of in their operating system, and so by using the hardware nic card as the TCP check sum offload device system using it’s processor. The server had much better performance, because it didn’t have to check all these check sums and calculate them.
So everyone beat a path to this particular door. All network interface card manufacturers started performing TCP check sum offloading, in the actual nic hardware. Well what we found was any NIC that we found in a Windows 2000 machine. A Windows 2000 server, that had TCP check sum receive and transmit enabled, those were the perpetrators of the bad TCP check sums. And we exercised those machines by using the que check agent, so that we could see the machine truly with an offered load. And we did find, and corroborate, that other domain controllers suffered too. But we didn’t find it on Windows 2003 server, we only found it on Windows 2000 servers, but we found it on pretty much every one that we looked for. So it was kind of my belief that perhaps we found a little problem here with Microsoft’s operating system, in regards to how it allows in Windows 2000, the TCP check sum offload to the network interface card. Because we found multiple of these. Now after we turned the TCP check sum offload off, in both cases of an HP network interface card, and also a 3Com network interface card.
The performance went up, so it didn’t matter what network interface card vendor it was. The performance went dramatically up, when we changed to turn TCP check sum offload off on Windows 2000 servers. And we calculated that out by doing a que check measurement and we got much better performance orders of magnitude, and an absence of TCP check sums. So we very meticulously change network interface card vendors. We turned TCP check sum load on and off, and every single time we found if you’re using a Windows 2000 server with TCP check sum offload, at least in the case of Hewlit Packard and also 3Com network interface cards. That they occasionally cause TCP check sum miscalculations to occur from those devices.
So it might be a good idea for you to go out and check all your Windows 2000 active directory servers and find out if they have the same thing turned on, and you might want to see if you’re getting TCP check sum errors. If you are, turn off TCP check sum offload, and I’ll bet your problem goes away, and your authentication occurs faster and your users of these applications are much happier.