Tuesday, December 21, 2010

6 Tips for Troubleshooting Active Directory

Tip 1: Determining DNS Health
The first thing we want to determine when assessing AD's overall health is DNS. Failing DNS can cause problems such as client authentication, application failure, Exchange failures with e-mail or GAL lookups, LDAP query failures, replication failures ... you get the picture. DNS is critical. There's a very powerful option for DCDiag.exe: C:\DCdiag /Test:DNS /e /v can be redirected to a file. The /e option indicates the test will be run on all DNS servers and /v is for verbose output. In a large environment, this may take a while to run, but it's worth the wait. I always read this starting at the bottom of the report, which is a table like that shown in Table 1. DCdiag runs six different tests: Authentication (Auth), Basic Connectivity (Basc), Forwarders (Forw), Delegation (Del), Dynamic registration enabled (Dyn) and Resource Record registration (RReg). The table also lists the External (Ext) test (connection to the Internet), but this command doesn't perform that test.
AuthBascForwDelDynRRegExt
Domain: Corp.net
Corp-DC02
Corp-DC03
Corp-DC01

PASS
PASS
PASS

WARN
WARN
WARN

n/a
n/a
n/a

n/a
n/a
n/a

PASS
PASS
PASS

FAIL
FAIL
FAIL

n/a
n/a
n/a
Domain: EMEA.Corp.net
EMEA-DC03
EMEA-DC02
EMEA-DC01

FAIL
FAIL
FAIL

FAIL
FAIL
FAIL

n/a
n/a
n/a

n/a
n/a
n/a

n/a
n/a
n/a

n/a
n/a
n/a

n/a
n/a
n/a
Domain: Americas.Corp.net
AM-DC10
AM-DC11
AM-DC12
AM-DC13

PASS
PASS
PASS
PASS

PASS
PASS
PASS
PASS

PASS
PASS
PASS
PASS

PASS
PASS
PASS
PASS

PASS
PASS
PASS
PASS

FAIL
FAIL
FAIL
FAIL

n/a
n/a
n/a
n/a
Table 1. Enterprise DNS Infrastructure test results.
In the sample output in Table 1, every DNS server -- which is usually also every domain controller (DC) -- in the forest is listed by domain. The cool thing about that is that it shows the domain configuration of the forest, which is very handy if you're a consultant or support engineer and not familiar with the environment. In reading the data in the table, the results are:
  • PASS: The DNS server passed this particular test
  • FAIL: The DNS server failed the test
  • N/A: The test was not run. This is usually due to a previous test failing, so it makes no sense to test a dependant function, which will fail anyway.
In Table 1, we see the value of this test. In a glance, I can see where my trouble spots are. In a multiple-domain forest, you must run this command with Enterprise Admin credentials, or you will get FAIL results on all tests for all DNS servers in domains for which you don't have privileges. This is what happened for the EMEA domain in Table 1.


For further help, a complete, detailed list of the test results is available earlier in the report. For instance, I can go to the top of the report and search for Corp-DC02, and get details as shown in Figure 1.
There's a lot of good information I've cut for brevity, but you can construct the DNS resolver configuration for each DNS server just from the data here. There's a lot of other data here as well. But the point is that this section shows why the forwarders test had an N/A in the summary in Table 1. Using this method, we can pick our way through all the warnings, failures and N/A results in the summary table. And, of course, the beauty is that you have all DNS servers in the forest in one nice text file generated from one command. This can be run even from a client that has DCDiag on it.
Figure 1. Test results for domain controllers:
DC: Test-DC1.Wtec.adapps.com
   Domain: Wtec.adapps.com       
    TEST: Authentication (Auth)
     Authentication test: Successfully completed
    TEST: Basic (Basc)
     Microsoft(R) Windows(R) Server 2003,
     Enterprise Edition (Service Pack level: 2.0) is supported
    NETLOGON service is running
    
     IP address is static
     IP address: 10.13.62.95
    
     DNS servers:
     10.13.62.105 () [Valid]
     10.13.62.95 () [Valid]

Tip 2: Determining AD Replication Health
The Support tools for Windows 2003 Service Pack 1 (SP1) include a new Repadmin option called /replsum. Similar to the DCdiag /Test:DNS command in Tip No. 1, /replsum collects replication information for every DC in every domain in the forest. It will report the last time replication occurred between the DC the command was run on and each other DC in the forest. While there are a number of different options, I've only used these:
Repadmin /replsum /bysrc /bydest /sort:delta
  • /bysrc indicates to collect data for DCs that have replicated from the DC this command is run on
  • /bydest indicates to collect data for DCs that have replicated to the DC this command is run on
  • /sort:Delta means to show the results in descending order
A sample output is shown in Table 2. This shows six DCs in the domain and the delta since their last replication. Here we can easily see that the domain is healthy except for WTEC-DC2, which has not replicated for five days with an error 1722. I know this DC is down due to a planned move in the data center. In addition, if a DC has not replicated for its tombstone lifetime days, it will be flagged in this report so an administrator can immediately see the danger and take steps to remove it from the network.
SourceLargest DeltaFails/Total%%Error
WTEC-DC205d.13h:39m:15s5 / 5100(1722)The RPC server is...
WTEC-DC141m:26s0 /200
DDMCWIN2K839m:00s0 /40
GSE-EXCH308m:59s0/60
MRNVMWTEC08m:56s0 / 40
WTEC-DC608m:34s0 / 60
Destination DCLargest DeltaFails/Total%%Error
WTEC-DC105d.13h:39m:39s5 /2520(1722) The RPC server is...
DDMCWIN2K841m:50s0 / 40
GSE-EXCH313m:35s0 / 60
MRNVMWTEC07m:24s0 / 40
WTEC-DC606m:25s0 / 60
Table 2.


Tip 3: Replication Details for All DCs in the Forest
This technique -- very similar to the method used in Tip No. 2 -- will provide more detail. The command is Repadmin /showrepl * /csv >showrepl.csv. This puts the output in .CSV format, as shown in Table 3.
I like this command because it frequently turns up errors in more detail than the Repadmin/replsum command. Additionally, it will often report different errors -- or additional errors, error codes and so on -- and provide the naming context and specific data that /replsum doesn't provide.
Naming ContextSource DCNumber of FailuresLast Failure TimeLast Success TimeLast Failure Status
DC=Wtec,DC=adapps,DC=hp,DC=comWTEC-DC25352/18/2009 21:362/13/2009 7:501722
DC=Wtec,DC=adapps,DC=hp,DC=comGSE-EXCH3002/18/2009 21:370
DC=Wtec,DC=adapps,DC=hp,DC=comWTEC-DC6002/18/2009 21:370
DC=Wtec,DC=adapps,DC=hp,DC=comMRNVMWTEC002/18/2009 21:370
Table 3.

Tip 4: NTDS Diagnostics
This tip is an absolute essential for getting additional data on Directory Service (DS) events. It's enabled per DC in the registry at HKEY_LOCAL_MACHINE\SYSTEM\
CurrentControlSet\Services\NTDS\Diagnostics. It's fairly straightforward. There are a variety of values that, when enabled, will dump additional events into the event log to assist with troubleshooting. The valid data for these values is an integer from zero to five, inclusive. The default value is zero, meaning minimal verbosity, and a setting of five will dump more than you want. Normally I set it at three and see if I need more. For instance, if I need more verbose details on replication, I'd set the "5 Replication Events" value to three and then reproduce the problem. Make sure to reset the value to zero when troubleshooting is concluded. These settings will fill up the event log quickly.
The most common values I use include:
  • 1 Knowledge Consistency Checker
  • 10 Performance Counters
  • 13 Name Resolution (this is DNS related)
  • 15 Field Engineering
  • 18 Global Catalog
  • 2 Security Events
  • 5 Replication Events
  • 8 Directory Access
  • 9 Internal Processing
The 9 Internal Processing value is handy for getting additional details for DS events that indicate an internal error has occurred. This will often cause additional events that will aid in diagnosing the problem. It's common to set more than one of these values. For instance, in replication troubleshooting, it would be reasonable to enable 1 Knowledge Consistency Checker and 5 Replication Events.
The 15 Field Engineering value will dump several additional events to the DS log. Unlike the other diagnostics, this one needs to be set to five to provide relevant data. Specifically, it will produce events 1644 and 1643, which report inefficient LDAP queries including the client who was the source of the query, the query string and the root of the query. This is important because one of the headaches related to AD is the Local System Authority Subsystem Service (LSASS) process using up enough resources to hang or crash a DC and cause client log-on delays. Inefficient LDAP queries by a user or by an application -- or even a Linux client log-on -- will put a heavier load on LSASS. Enabling this diagnostic will quickly identify the guilty party by name or IP address. Some admins leave this diagnostic permanently enabled to monitor a busy environment, but again, it will fill up the event logs and possibly hide or overwrite other important events in the DS log.

Tip 5: Group Policy Management Console and HTML Reports

I'm sure nearly every AD admin alive uses this tool, but I thought it would be worth mentioning the value of HTML reports. There are two types of reports I use very frequently because I'm dealing with environments I'm not familiar with, and I usually want proof of the settings of a Group Policy Object (GPO) as well as the results from a particular client or clients.

Getting a report of a GPO is valuable even if you're the admin because it shows exactly what settings are defined -- in fact, it shows only the settings that are defined -- so you don't have to wade through the GP editor to find which ones are set. This is a quick way to see if the GPO is defined as you think it is. It also shows links, filters applied and other details. HTML reports for the Default Domain Policy are easy to read and can be expanded and closed by sections as needed, because they're in HTML format. To get this report, just right-click on any GPO in the domain tree and select "Save Report."
One of the problems with solving a GPO-related issue at a client is pestering the user, who may be hundreds of miles away, to log in and get a GPResult. If the user has logged in at least once on a workstation, Group Policy Management Console (GPMC) can provide you with an HTML-formatted GPResult that is produced when the user logs on. This is obtained in the GPMC console by right-clicking the "Group Policy Results" node and selecting the Group Policy Results Wizard. Of course, GPResult is a necessity in diagnosing client-side issues.

Tip 6: Active Directory Performance Diagnosis
While there are many other troubleshooting tips I could have elaborated on here, this is one that probably isn't well known. In troubleshooting server performance, there's a standard set of objects, including processor, LogicalDisk, Server, Memory, System and so on. However, there's an NTDS object that provides us with relevant AD counters such as DRA, Kerberos, LDAP and even NTLM-related counters. In addition, we can collect valuable AD data by monitoring the LSASS process. I recommend enabling the following:
  • Object: Process
    • Counters: %ProcessorTime, Working Set, Working Set Peak
  • Object: NTDS
    • Counters: (all counters)
Unfortunately, there's little information available on what acceptable thresholds are. The only one I've found that even addresses this is Microsoft's Branch Office Deployment guide. While there are many counters may or may not be familiar, I've only found a few that are significant:
  • DRA Pending Replication Synchronizations: These are the directory synchronizations that are queued and are essentially replication backlog. Microsoft only says these values should be "as low as possible" and that "hardware is slowing replication." These could be indications that DC resources are at high utilization.
  • LDAP Client Sessions: This is the number of sessions opened by LDAP clients at the time the data is taken. This is helpful in determining LDAP client activity and if the DC is able to handle the load. Of course, spikes during normal periods of authentication -- such as first thing in the morning -- are not necessarily a problem, but long sustained periods of high values indicate an overworked DC.
  • LDAP Bind Time: This is the time in milliseconds needed to complete the last successful LDAP binding. Documentation says that this should be "as low as possible," but if you run the perfmon output through the Performance Analyzer of Logs (PAL) tool, it will flag 15 milliseconds as a warning threshold and 30 milliseconds as an error threshold. The fix is more resources: processor, memory and so on. (Note: PAL is an excellent performance-analysis tool, and is available online.)
In diagnosing the LSASS process, as in any performance analysis, a baseline must be established. A note on Microsoft's DS blog indicates that if a baseline is not available, use 80 percent. That is, the LSASS counters shouldn't indicate more than 80 percent consumption. Above 80 percent consumption indicates an overload condition, which could be a high LDAP query demand (see Tip No. 4) or general lack of server resources. The resolution is to increase resources or reduce demand, but be advised this has the potential to cause a performance hit in the domain.
If you really want to solve your LSASS resource issues, put your DCs on x64 platforms with several processors and 32GB of RAM. You might be surprised at how much memory LSASS really can use.

No comments :