Case Summary: DHCP 服务因系统资源耗尽停止工作导致楼宇用户无法获取 IP

Product/Service: Windows Server 2016 — DHCP Server Role (Failover 配置)

1. 症状 (Symptoms)

使用该组 DHCP 服务的楼宇用户无法获取到新 IP 地址
检查备份 DHCP 服务器（dhcp-srv01.contoso.com）显示地址池使用率 95%，但实际地址池中并未分配那么多 IP
检查故障 DHCP 服务器（dhcp-srv02.contoso.com），发现 DHCP 服务处于停止状态
手动启动 DHCP 服务一直显示“启动中”，无法恢复
重启故障服务器后，DHCP 服务恢复正常，用户可正常获取 IP，地址池使用率也回归正常

2. 背景 (Background / Environment)

项目	详情
故障服务器	dhcp-srv02.contoso.com
Failover 伙伴	dhcp-srv01.contoso.com
操作系统	Windows Server 2016 Standard (Build 14393)
虚拟化平台	VMware VM
DHCP Scope	10.51.141.0, 10.51.143.0, 172.17.81.0
Failover 关系	dhcp-srv01 与 dhcp-srv02 互为备份
服务器 Uptime	463 天（上次重启 2024-11-29，直到故障时未重启）
安装的第三方软件	Titan Agent Service for Windows, Symantec Endpoint Protection

3. Troubleshooting 过程 (Investigation & Troubleshooting)

3.1 现场初步排查

检查备份 DHCP 服务器 dhcp-srv01
- 发现地址池使用率显示 95%
- 但实际活跃租约数远低于该比例
- 结论： 使用率虚高，可能与 Failover 状态异常有关
检查故障 DHCP 服务器 dhcp-srv02
- DHCP 服务处于 Stopped 状态
- 手动启动服务，卡在”启动中”无法完成
- 结论： 服务无法恢复，需要进一步排查原因
重启故障服务器
- 管理员于 2026-03-11 10:08 通过 explorer.exe 发起重启
- 系统因响应缓慢未能正常关机（Event 6008 非正常关机）
- 重启后 DHCP 服务在 10:16:01 成功启动
- Failover 状态在 10:16:23 恢复 NORMAL
- 用户恢复正常获取 IP

3.2 Event Log 深度分析

通过分析 dhcp-srv02 的三个日志文件（system.evtx、DHCP-SERVER.evtx、application.evtx），发现了完整的故障链。

3.2.1 System Log — DHCP 相关事件统计

Event ID	含义	次数	首次出现	最后出现
1010	DHCP 数据库访问错误	470	2026-01-11	2026-03-07 05:50
1016	DHCP 数据库清理错误	905	2026-01-11	2026-03-07 05:50
1020	DHCP 判定自身未授权	3,502	2025-12-27	2026-03-11 13:16
1376	运行在 DC 上但无法确认授权	3,504	2025-12-27	2026-03-11 13:16
1054	DHCP 致命错误，服务自行终止	1	2026-03-07 06:39	—
1059	DNS 动态更新失败	15	2026-01-15	2026-03-07 05:39

Event 1010/1016 的错误码为 0x4E2D (20013)，对应 JET 数据库操作失败。

3.2.2 System Log — Titan Agent Service 崩溃循环

Event ID	含义	次数
7000	Titan Agent 启动失败	21,764
7031	Titan Agent 意外终止	123（日志中编号 #227→#349）

Event 7000 错误码分布：

错误码	含义	次数	首次出现
%%1053	SERVICE_REQUEST_TIMEOUT（启动超时）	20,948	2025-12-27
%%1455	NO_SYSTEM_RESOURCES（无系统资源）	711	2026-01-09
%%1450	NO_SYSTEM_RESOURCES（无系统资源）	38	2026-01-13
%%8	NOT_ENOUGH_MEMORY（内存不足）	18	2026-01-13
%%1054	SERVICE_ALREADY_RUNNING	42	2026-01-11

Titan Agent 每日失败频率：

2025-12-27 首日出现，368 次
2025-12-28 至 2026-01-10：连续 14 天每天 576 次（≈ 每 2.5 分钟崩溃重启一次）
2026-01-28 至 2026-02-07：再次连续 11 天每天 576 次
SCM 配置：每次崩溃后 120 秒自动重启

错误类型按周演进——资源耗尽的渐进过程：

周次	总失败	%%1053 超时	%%1455 无资源	%%8 内存不足
W52 (12/27)	944	944 (100%)	0	0
W01 (1月初)	2,304	2,304 (100%)	0	0
W02 (1/9)	4,031	4,023	6 ⚠️ 首现	0
W03 (1/13)	2,557	2,312	211 📈	7 ⚠️ 首现
W10 (3月初)	218	179	32	2

3.2.3 System Log — 系统级资源耗尽

事件	次数	时间范围	说明
Kernel-General Event 6（注册表刷新失败）	43	2026-01-17 → 2026-03-07	OS 内核无法将注册表写入磁盘
DCOM Event 10010（服务超时）	~880	持续	DCOM 服务无法在规定时间注册

3.2.4 Application Log — ESENT/JET 数据库错误

Event ID	含义	次数	关键数据
215	DHCP 数据库备份中止	875	svchost PID 1904, dhcp.mdb
217	数据库备份失败 -1011 (JET_errOutOfMemory)	3	2/25, 2/26, 2/28
482	Windows Update DB 写入失败	1	error 8 (NOT_ENOUGH_MEMORY), 3/1
104	Windows Update JET 引擎停止	1	error -1011, 3/3

ESENT Event 217 的 XML 明确记录：

进程: svchost (PID 1904)
错误: -1011 (JET_errOutOfMemory)
数据库: C:\Windows\system32\dhcp\dhcp.mdb

3.2.5 DHCP-SERVER Log — Failover 状态

DHCP-SERVER.evtx 仅包含 4 种事件，共 1,951 条：

Event ID	含义	次数
20322	DNS 动态更新失败	1,946
20251	Failover 状态转换	3
20254	Failover 连接建立	1
20259	Partner 状态通知	1

关键发现：所有 Failover 事件仅出现在重启后（2026-03-11 10:16:23）

重启后的 Failover 状态转换：

STARTUP → COMMUNICATION_INT → NORMAL （数秒内完成）

整个故障期间（3/7 → 3/11），Failover 没有被设置为 “Partner Down”。 dhcp-srv01 一直处于 COMMUNICATION_INT 状态 → 这就是为什么地址池显示 95% 使用率。

3.3 DHCP 停止的直接机制

DHCP 服务是自行停止的（非崩溃），证据：

有 Event 1054（DHCP 主动决定停止）→ 紧接 Event 7036（进入 stopped 状态）
没有 Event 7031/7034（无意外终止或异常崩溃记录）

Event 1054 含义：”The DHCP/BINL service has determined that it is not the valid/authorized DHCP server, or has encountered an error that prevents it from continuing to serve DHCP clients.”

这是 Windows DHCP Server 的内置安全机制：如果 DHCP 无法在 AD 中验证自身授权状态，会主动停止以防成为”流氓 DHCP 服务器”。

崩溃前最后 2 小时的事件序列：

50:44  Event 1016/1010 — DHCP 数据库访问失败 (0x4E2D)
50:44  Event 1020/1376 — DHCP 无法验证 AD 授权
39:33  Event 1059 — DNS 动态更新失败（最后一次）
40:42  Kernel Event 6 — 注册表刷新失败（最后一次）
39:33  Event 1054 — DHCP 致命错误，决定自行终止 💀
40:28  Event 7036 — "DHCP Server service entered the stopped state"

3.4 备份服务器地址池 95% 使用率的原因

当 dhcp-srv02 停止工作后，dhcp-srv01 检测到通信丢失，自动进入 COMMUNICATION_INT（通信中断）状态：

特征	COMMUNICATION_INT 状态	Partner Down 状态
触发方式	自动（检测到通信丢失）	需管理员手动设置
可用地址	仅自己分配的那部分（~50%）	可使用全部地址池
租约清理	不清理对方的租约	MCLT 后可回收对方租约
本次情况	✅ 正是此状态	❌ 未执行

4. Blockers 与解决 (Blockers & How They Were Resolved)

Blocker	影响	如何解决
DHCP 服务无法手动启动（卡在”启动中”）	服务 4 天无法恢复（3/7 → 3/11）	重启服务器后恢复
系统资源全面耗尽，无法创建新线程/分配内存	不仅 DHCP，多个系统服务均受影响	重启服务器释放所有资源
备份服务器未设置为 Partner Down	存活服务器地址池受限，部分用户无法获取 IP	故障服务器重启后 Failover 自动恢复 NORMAL

5. 根因与解决方案 (Root Cause & Resolution)

Root Cause

根本原因：Titan Agent Service for Windows 从 2025-12-27 起持续崩溃重启，75 天内产生 21,764 次启动失败，以每 2.5 分钟一次的频率反复消耗系统资源，最终导致服务器全面资源耗尽。

完整因果链：

Titan Agent 崩溃循环（75天, 21,764次, 每2.5分钟一次）
  ↓ 每次循环：分配进程内存 → 启动超时 → SCM 终止 → 资源未完全释放
系统资源逐步耗尽（内存/线程/页面文件/句柄）
  ↓ 1/9 首现 NO_SYSTEM_RESOURCES, 1/13 首现 NOT_ENOUGH_MEMORY
DHCP JET 数据库无法操作（dhcp.mdb, error -1011 JET_errOutOfMemory）
  ↓ 1/11 开始，累计 1,375 次数据库访问错误 + 875 次备份失败
DHCP 无法查询 AD 验证授权（LDAP/RPC 调用因资源不足失败）
  ↓ Event 1020 × 3,502 + Event 1376 × 3,504
DHCP 内置安全机制触发 → 服务自行终止（Event 1054）
  ↓ 2026-03-07 06:39
DHCP 手动重启也失败（系统连创建新线程都做不到）
  ↓ 4 天停机
重启服务器后一切恢复（内存释放，JET 数据库正常打开）

加剧因素：服务器 463 天（2024-11-29 起）未重启，泄漏的资源永远无法被回收。

Resolution

管理员于 2026-03-11 10:08 重启故障服务器 dhcp-srv02
服务器于 10:15:51 完成启动
DHCP 服务于 10:16:01 成功启动
Failover 于 10:16:23 恢复 NORMAL 状态，两台服务器同步租约数据库
用户恢复正常获取 IP，地址池使用率回归正常

后续建议

优先级	操作
🔴 紧急	修复或移除 Titan Agent Service — 该第三方 Agent 是根因，从未成功启动，需联系 Titan 供应商排查或禁用该服务
🔴 紧急	建立定期重启计划 — 463 天未重启导致资源泄漏累积，建议每月或每季度维护重启
🟠 高	配置 DHCP 服务监控告警 — 监控 Event 1010/1016/1054，在数据库错误阶段即发出预警
🟠 高	制定 Failover “Partner Down” 操作文档 — 当确认 partner 宕机时，应在存活服务器上执行 `Set-DhcpServerv4Failover -PartnerDown` 释放全部地址池
🟡 中	增大页面文件 — 多次出现 “paging file too small” 错误
🟡 中	考虑升级操作系统 — Server 2016 即将结束扩展支持

6. 经验教训 (Lessons Learned)

技术知识

DHCP Event 1054 的真正含义：不仅仅是”未授权”，当 DHCP 无法访问 AD（因为资源不足导致 LDAP/RPC 失败）时，也会触发此安全机制自行终止
DHCP Failover COMMUNICATION_INT 状态：存活服务器只能使用约 50% 的地址池，不会自动扩展。必须手动设置 Partner Down 才能释放全部地址
JET_errOutOfMemory (-1011)：当系统内存不足时，DHCP 使用的 JET（ESENT）数据库引擎无法正常操作 dhcp.mdb
Event 7000 错误码演进：%%1053（超时）→ %%1455/%%1450（无资源）→ %%8（内存不足）的渐进变化，是系统资源逐步耗尽的明确信号

排查方法

Event Log XML 解析：当 Message 字段为空时，通过 $event.ToXml() 解析 <EventData> 获取完整参数
Binary 字段解码：Event 1010/1016 的 Binary 字段 2D4E0000 为小端序，实际错误码为 0x00004E2D = 20013
关联分析：DHCP 停止本身只是表象，需关联 SCM 事件（7000/7031）、ESENT 事件（215/217）、内核事件（Kernel-General 6）还原完整过程
按周聚合错误类型：通过观察错误码的”演进”精确判断资源耗尽的时间节点

预防措施

为 DHCP 服务器设置定期维护重启窗口
监控系统资源（内存、句柄、线程数），设置阈值告警
对第三方 Agent 的 SCM 恢复策略设置最大重试次数（如 3 次后停止重试）
定期审查 Event Log 中的 7000/7031 事件，及时发现服务崩溃循环

7. 参考文档 (References)

暂无可验证的参考文档。

Case Summary: DHCP Service Stopped Due to System Resource Exhaustion Causing IP Allocation Failure

Product/Service: Windows Server 2016 — DHCP Server Role (Failover Configuration)

1. Symptoms

Users in the building served by this DHCP pair could not obtain new IP addresses
The backup DHCP server (dhcp-srv01.contoso.com) showed 95% address pool utilization, but actual lease count did not match
The faulty DHCP server (dhcp-srv02.contoso.com) had its DHCP service in a Stopped state
Manual attempts to start the DHCP service stuck at “Starting” and never completed
After rebooting the faulty server, DHCP service recovered, users could obtain IPs, and pool utilization returned to normal

2. Background / Environment

Item	Details
Faulty Server	dhcp-srv02.contoso.com
Failover Partner	dhcp-srv01.contoso.com
Operating System	Windows Server 2016 Standard (Build 14393)
Virtualization	VMware VM
DHCP Scopes	10.51.141.0, 10.51.143.0, 172.17.81.0
Failover Relationship	dhcp-srv01 and dhcp-srv02 in Hot Standby/Load Balance
Server Uptime	463 days (last reboot 2024-11-29)
Third-Party Software	Titan Agent Service for Windows, Symantec Endpoint Protection

3. Investigation & Troubleshooting

3.1 Initial On-Site Investigation

Checked backup DHCP server dhcp-srv01
- Address pool showed 95% utilization
- Actual active leases were far fewer
- Conclusion: Inflated utilization, likely related to abnormal failover state
Checked faulty DHCP server dhcp-srv02
- DHCP service was in Stopped state
- Manual start attempt hung at “Starting” indefinitely
- Conclusion: Service unrecoverable, deeper investigation needed
Rebooted the faulty server
- Administrator initiated reboot at 2026-03-11 10:08 via explorer.exe
- System was too unresponsive for clean shutdown (Event 6008)
- DHCP service started successfully at 10:16:01
- Failover returned to NORMAL at 10:16:23

3.2 Deep Event Log Analysis

3.2.1 Titan Agent Service Crash Loop — The Root Cause

The Titan Agent Service for Windows had been in a continuous crash-restart loop since 2025-12-27:

21,764 startup failures (Event 7000) over 75 days
123 unexpected terminations (Event 7031, numbered #227→#349)
Restart interval: 120 seconds → 576 failures per day (every 2.5 minutes)

Error code evolution showing progressive resource exhaustion:

Week	Total	%%1053 Timeout	%%1455 No Resources	%%8 No Memory
W52 (12/27)	944	944 (100%)	0	0
W02 (1/9)	4,031	4,023	6 ⚠️ First	0
W03 (1/13)	2,557	2,312	211 📈	7 ⚠️ First
W10 (Mar)	218	179	32	2

3.2.2 DHCP Database Failures

Event ID	Meaning	Count	Key Data
1010/1016	Database access/cleanup error	1,375	Error 0x4E2D, from 2026-01-11
ESENT 215	DHCP backup halted	875	from 2026-01-11
ESENT 217	JET_errOutOfMemory (-1011)	3	2/25, 2/26, 2/28

3.2.3 DHCP Service Self-Termination

The DHCP service stopped itself gracefully (not a crash):

Event 1054 present → followed by Event 7036 (stopped state)
No Event 7031/7034 (no unexpected termination)
This is DHCP’s built-in safety mechanism: stops when it cannot verify AD authorization

3.2.4 Failover State

No “Partner Down” state was ever set during the 4-day outage
dhcp-srv01 remained in COMMUNICATION_INT → explaining 95% pool utilization
All failover events only appeared after reboot: STARTUP → COMMUNICATION_INT → NORMAL

3.3 How DHCP Service Stopped — The Mechanism

50:44  Event 1016/1010 — Database access failed (0x4E2D)
50:44  Event 1020/1376 — Cannot verify AD authorization
39:33  Event 1054 — Fatal error, self-terminate decision 💀
40:28  Event 7036 — "DHCP Server service entered the stopped state"

4. Blockers & How They Were Resolved

Blocker	Impact	Resolution
DHCP service stuck at “Starting”	4-day outage (3/7 → 3/11)	Server reboot
System-wide resource exhaustion	Multiple services affected	Server reboot released all resources
Backup server not set to Partner Down	Address pool restricted to ~50%	Failover auto-recovered after reboot

5. Root Cause & Resolution

Root Cause

The Titan Agent Service for Windows had been in a continuous crash-restart loop since 2025-12-27, generating 21,764 startup failures over 75 days at a rate of once every 2.5 minutes, progressively exhausting all system resources.

Titan Agent crash loop (75 days, 21,764 failures, every 2.5 minutes)
  ↓ Each cycle leaks memory/handles
System resource exhaustion (memory/threads/page file/handles)
  ↓
DHCP JET database fails (dhcp.mdb, JET_errOutOfMemory -1011)
  ↓
DHCP cannot verify AD authorization
  ↓
DHCP safety mechanism → self-terminates (Event 1054)
  ↓
Manual restart fails (no resources to create threads)
  ↓
Server reboot restores everything

Aggravating factor: 463 days without reboot — leaked resources never reclaimed.

Resolution

Server rebooted at 2026-03-11 10:08
DHCP service started at 10:16:01
Failover recovered to NORMAL at 10:16:23
Users resumed normal IP acquisition

Recommendations

Priority	Action
🔴 Critical	Fix or remove Titan Agent Service — never successfully started; contact vendor or disable
🔴 Critical	Establish regular reboot schedule — prevent resource leak accumulation
🟠 High	Configure DHCP monitoring — alert on Event 1010/1016/1054
🟠 High	Document Failover “Partner Down” procedure
🟡 Medium	Increase page file size, consider OS upgrade

6. Lessons Learned

Technical Knowledge

Event 1054 triggers not only for unauthorized servers, but also when resource starvation prevents AD authorization checks
Failover COMMUNICATION_INT limits surviving server to ~50% of address pool; manual Partner Down is required
JET_errOutOfMemory (-1011) directly caused by system memory exhaustion
Event 7000 error code progression (%%1053 → %%1455 → %%8) signals gradual resource depletion

Troubleshooting Methods

Parse Event Log XML ($event.ToXml()) when Message fields are empty
Decode Binary fields (little-endian): 2D4E0000 → 0x4E2D = 20013
Cross-log correlation: SCM + ESENT + Kernel events reveal complete picture
Weekly error aggregation identifies resource exhaustion timeline

Prevention

Regular maintenance reboot windows for infrastructure servers
System resource monitoring with threshold alerts
Set maximum retry count on third-party agent SCM recovery actions
Audit Event Log 7000/7031 events regularly to detect crash loops early

7. References

No verified reference documents available.