Introduction
Active Directory (AD) is the heartbeat of identity and access management in most organizations. Every login, policy enforcement, and group membership request depends on it. Yet, AD’s complexity often hides problems until they spill into production — logins start failing, group policies stop applying, or replication between domain controllers breaks.
That’s why regular health checks and proactive FSMO role management are essential. This article provides a step-by-step guide to monitoring AD health, explains FSMO roles in depth, and offers practical troubleshooting examples from the field.
1. Why Active Directory Health Checks Matter
Think of AD as a nervous system: it only takes one broken connection for the entire body to react poorly.
If you skip regular checks, you risk:
- Replication failures → Users in one office may not exist in another until replication catches up.
- Authentication delays → Logons take minutes instead of seconds.
- FSMO failures → Schema updates, time synchronization, or RID allocation could fail, stopping account creation.
- Security blind spots → Replication delays can expose accounts that should have been disabled or deleted.
Real-world example: At one company, the RID Master failed silently. New user accounts couldn’t be created, leading to a week-long backlog in onboarding — all because no one had checked FSMO health in months.
2. Tools Every Admin Should Know
Built-in Windows Tools
- dcdiag → Domain controller diagnostic tool. Runs dozens of checks against DNS, replication, and AD services.
- repadmin → Provides replication summaries and latency details.
- netdom → Quickly queries FSMO role ownership and domain trust status.
- Event Viewer → Logs all AD-related issues, from failed logons to replication errors.
Cloud/Enterprise Tools
- Azure Monitor → Monitors hybrid AD environments with alerting.
- Microsoft Defender for Identity → Detects unusual AD behavior like golden ticket attacks.
- SCOM (System Center Operations Manager) → Large-scale enterprise monitoring.
3. FSMO Roles Explained (with Scenarios)
FSMO Role | Scope | Responsibility | What Happens If It Fails? |
---|---|---|---|
Schema Master | Forest | Manages schema changes (object types, attributes). | You won’t be able to extend schema (e.g., for Exchange). Normal ops continue. |
Domain Naming Master | Forest | Adds/removes domains. | You can’t add/remove domains. No daily impact. |
RID Master | Domain | Allocates unique Relative IDs for new security principals. | You can’t create new accounts once DCs run out of RIDs. |
PDC Emulator | Domain | Time sync, password changes, GPO updates. | Logon failures, Kerberos issues, clock drift across systems. |
Infrastructure Master | Domain | Maintains references to objects across domains. | Group memberships may display incorrectly in multi-domain setups. |
Pro tip: Spread FSMO roles across at least two DCs for redundancy. Don’t leave them all on one server.
4. Running Health Checks (Step-by-Step)
Check FSMO Roles
powershell
netdom query fsmo
This lists which DCs hold the roles. Save this in your documentation.
Run Diagnostics on Domain Controllers
powershell
dcdiag /v /c /d /e > C:\ADHealthCheck.txt
/v
→ Verbose mode/c
→ Runs all tests/d
→ Tests DNS/e
→ Tests all domain controllers
The output can be hundreds of lines — review carefully for failures and warnings.
Verify Replication Health
powershell
repadmin /replsummary
repadmin /showrepl
/replsummary
→ Summarizes replication failures./showrepl
→ Detailed per-DC replication partners.
Example Output Snippet:
nginx
Source DSA largest delta fails/total %% error
DC1 00h:00m:15s 0 / 30 0
DC2 15h:30m:05s 3 / 40 7
Interpretation: DC2 has replication issues with 3 failed attempts — investigate immediately.
Check Event Logs for Warnings/Errors
Open Event Viewer → Applications and Services Logs → Directory Service.
Common Event IDs:
- 1311 → Replication failure between sites.
- 2087 → DNS lookup failure for a DC.
- 13568 → Journal wrap error (serious replication issue).
5. Common Issues & Fixes (Troubleshooting Matrix)
Event ID / Symptom | Cause | Fix |
---|---|---|
1311 (Replication failure) | Misconfigured site links or firewalls blocking ports. | Check AD Sites & Services, verify TCP/UDP 135, 389, 636, 3268 are open. |
2087 (DNS lookup failure) | DNS misconfigured or missing SRV records. | Run ipconfig /registerdns on DC, verify DNS zones. |
13568 (Journal wrap error) | DC fell too far behind in replication. | Run ntdsutil to repair DB or rebuild DC. |
RID exhaustion | RID Master offline, RIDs depleted. | Seize RID Master role, restart RID pool allocation. |
Time sync errors | PDC Emulator not syncing. | Configure NTP server (w32tm /config /manualpeerlist:time.windows.com ). |
6. Proactive Monitoring Best Practices
- Automate health checks
- Use scheduled PowerShell tasks to run
dcdiag
andrepadmin
. - Save results to logs and compare week-to-week.
- Use scheduled PowerShell tasks to run
- Baseline your metrics
- Example: Replication latency under 15 minutes might be normal in your environment. Anything above that should trigger alerts.
- Alerting & Notifications
- Configure Azure Monitor to email you if a DC stops replicating.
- Use SCOM for dashboards showing domain health in real time.
- Distribute FSMO roles
- Place at least one forest-level role on a different DC from your domain-level roles.
- Test transferring roles with:
ntdsutil roles connections "connect to server DC2" quit "transfer pdc" quit
- Document & Report
- Keep a living document with:
- Current FSMO role holders
- Last health check date
- Known issues and resolutions
- This becomes invaluable during audits or staff turnover.
- Keep a living document with:
FAQs
Q: How often should I run health checks?
A: Weekly is the minimum. High-security environments benefit from daily checks.
Q: Can FSMO roles move automatically?
A: No, but they can be seized if the DC holding them is down permanently.
Q: What’s the most critical FSMO role?
A: The PDC Emulator — time sync and password resets rely heavily on it.