Deep Dive: Windows Failover Clustering å®å šæå
Deep Dive: Windows Server Failover Clustering
Topic: Windows Server Failover Clustering
Category: Infrastructure / High Availability
Level: äžçº§å°é«çº§ / Intermediate to Advanced
Last Updated: 2026-03-13
äžæç
1. æŠè¿° â ä»ä¹æ¯æ é蜬移矀é
æ é蜬移矀é (Failover Cluster) æ¯äžç»ç¬ç«çæå¡åšïŒå®ä»¬ååå·¥äœåŠåäžäžªåäžç³»ç»ïŒäžºå ³é®äžå¡åºçšæäŸé«å¯çšæ§ (High Availability)ã
矀éèœæäŸä»ä¹ïŒ
| èœå | 诎æ |
|---|---|
| é«å¯çšæ§ (HA) | æ¶é€åç¹æ éïŒç¡®ä¿æå¡æç»è¿è¡ |
| æ é蜬移 (Failover) | åœèç¹æ éæ¶ïŒèªåšå°æå¡è¿ç§»å°å¥åº·èç¹ |
| èŽèœœåè¡¡ | Active/Active æš¡åŒäžåæ å·¥äœèŽèœœ |
| é¶åæºç»Žæ€ | éè¿æ»åšæŽæ°å®ç°äžåæºçè¡¥äžåå级 |
Failover Clustering vs NLB
| 绎床 | Failover Clustering | Network Load Balancing (NLB) |
|---|---|---|
| ç®ç | åºçšçº§é«å¯çš | çœç»æµéèŽèœœåè¡¡ |
| æ éæ£æµ | äž»åšå¥åº·æ£æ¥ | å¿è·³æ£æµ |
| å ±äº«ååš | éèŠïŒäŒ ç»æš¡åŒïŒ | äžéèŠ |
| éçšåºæ¯ | æ°æ®åºãæä»¶æå¡åšãExchange | Web æå¡åšãIIS |
| æå€§èç¹æ° | 64 (2012+) | 32 |
| IP å°å | èæ IP è·éèµæºç§»åš | ææèç¹å ±äº«èæ IP |
ð ç±»æ¯ïŒFailover Clustering 忝äžäžªææ¯å®€æå€çšå»çïŒäž»åå»çåäžïŒå€çšå»çç«å»æ¥æç»§ç»ææ¯ïŒNLB 忝å€äžªçªå£åæ¶åçäžå¡ïŒåæ£å®¢æµã
2. æ žå¿æ¯è¯
2.1 åºç¡æŠå¿µ
| æ¯è¯ | è±æ | 诎æ |
|---|---|---|
| 矀é (Cluster) | Cluster | äžç»ååå·¥äœçç¬ç«æå¡åš |
| èç¹ (Node) | Node | 矀éäžçæ¯äžå°æå¡åš |
| èµæº (Resource) | Resource | 矀é管ççæå°åå ïŒåŠ IP å°åãç£çãçœç»åç§°ãæå¡ïŒ |
| èµæºç» (Resource Group) | Resource Group | äžç»çžå ³èµæºçéåïŒäœäžºäžäžªæŽäœè¿è¡æ é蜬移 |
| æ é蜬移 (Failover) | Failover | èµæºä»æ éèç¹èªåšè¿ç§»å°å¥åº·èç¹ |
| æ éåå€ (Failback) | Failback | èµæºåšåèç¹æ¢å€åè¿ååèç¹ |
2.2 å ³é®å¯¹è±¡
| æ¯è¯ | 诎æ |
|---|---|
| 客æ·ç«¯æ¥å ¥ç¹ (CAP) | 客æ·ç«¯è®¿é®çŸ€éæå¡ççœç»åç§°å IP å°å |
| 矀éå称对象 (CNO) | çŸ€éæ¬èº«åš Active Directory äžçè®¡ç®æºå¯¹è±¡ |
| èæè®¡ç®æºå¯¹è±¡ (VCO) | 矀éäžæ¯äžªè§è²/æå¡åš AD äžå建çè®¡ç®æºå¯¹è±¡ |
| è§è¯èµæº (Witness) | åž®å©å³å®ä»²è£çé¢å€æç¥šèµæºïŒç£çææä»¶å ±äº«ïŒ |
| å ±äº«ç£ç | è³å°äž€äžªèç¹å¯ä»¥è®¿é®çååš |
2.3 æå¡ç»ä»¶
| ç»ä»¶ | 诎æ |
|---|---|
| 矀éæå¡ (ClusSvc) | è¿è¡åšæ¯äžªèç¹äžçæ žå¿æå¡ïŒç®¡ç矀éæäœ |
| RHS (Resource Host Subsystem) | èµæºå®¿äž»åç³»ç»ïŒæ¿ä»£äºæ§ççèµæºçè§åšïŒæç®¡åçæ§èµæº DLL |
| NetFT | 矀éèæçœç»éé åšé©±åšïŒæäŸå®¹ééä¿¡ |
3. çŸ€éæ¶æ (Deep Dive)
3.1 äžå±æ¶æ
âââââââââââââââââââââââââââââââââââââââââââââââââââ
â é¡¶å±ïŒçŸ€éæœè±¡ (Abstractions) â
â èç¹ãèµæºç»ãèµæºãæ
é蜬移çç¥ãäŸèµå
³ç³» â
â èµæºç¶æç®¡çãæ
éååº â
âââââââââââââââââââââââââââââââââââââââââââââââââââ€
â äžå±ïŒçŸ€éæäœ (Operation) â
â æå管çãéç»æäœ (Regroup) â
â è·šèç¹é
眮äžèŽæ§ç»Žæ€ â
â GUM (å
šå±æŽæ°ç®¡çåš) åºååæŽæ° â
âââââââââââââââââââââââââââââââââââââââââââââââââââ€
â åºå±ïŒæäœç³»ç»äº€äº (OS Interaction) â
â ååºç®¡çåšã矀éç£ç驱åšãNetFT çœç»é©±åš â
â æä»¶ç³»ç»ãå®å
šãçœç»æ¥å£ç®¡ç â
âââââââââââââââââââââââââââââââââââââââââââââââââââ
3.2 æ žå¿ç»ä»¶è¯Šè§£
矀éæå¡ç»ä»¶å
³ç³»åŸ
âââââââââââââââââââââââââââââââââââââââââââ
客æ·ç«¯è¯·æ±
â
âŒ
ââââââââââââââââ ââââââââââââââââââââ
â Resource â â Topology Manager â
â Control â â åç°å绎æ€çœç»ææ â
â Manager (RCM) â â è°çš clnet æäžŸ â
â æ
é蜬移çç¥ â â ææçœç»æ¥å£ â
â èµæºäŸèµæ â ââââââââââââââââââââ
ââââââââ¬ââââââââ â
â âŒ
⌠ââââââââââââââââââââ
ââââââââââââââââ â Database Manager â
â Object â â é«å¯çšå®¹éæ°æ®åº â
â Manager ââââââºâ 泚åè¡šå¯æ¬ç®¡ç â
â å
åå
³ç³»æ°æ®åºâ â CLFS æ¥å¿æä»¶ â
â 矀é对象管ç â â Paxos äžèŽæ§ç®æ³ â
ââââââââââââââââ ââââââââââââââââââââ
â â
⌠âŒ
ââââââââââââââââ ââââââââââââââââââââ
â Membership â â Global Update â
â Manager (MM) â â Manager (GUM) â
â éç»ç®æ³ â â åºååååæŽæ° â
â Gossip åè®® â â Locker èç¹æš¡å â
â èç¹å å
¥/æ
é â â GUM éæºå¶ â
ââââââââ¬ââââââââ ââââââââââââââââââââ
â
âŒ
ââââââââââââââââ ââââââââââââââââââââ
â Host Manager â â Quorum Manager â
â TCP 3343ç«¯å£ â â 仲è£å€å® â
â èç¹è¿æ¥ç®¡ç â â æç¥šè®¡ç® â
â å®å
𿡿 ââââââºâ Paxos æ çŸ â
â NetFTè·¯ç±æŽæ° â â é
çœ®å¯æ¬è·èžª â
ââââââââââââââââ ââââââââââââââââââââ
â
âŒ
ââââââââââââââââââââââââââââââââââââââââ
â Security Manager + NetFT Driver â
â SSPI æ¡æ â å¯é¥åå â å®å
šéé â
â æ¶æ¯çŸå/å å¯ â 容éå€è·¯åŸéä¿¡ â
ââââââââââââââââââââââââââââââââââââââââ
3.3 å ³é®ç»ä»¶è¯Žæ
MessagingïŒæ¶æ¯äŒ éïŒïŒ
- æäŸææçŸ€éå éä¿¡åè¯
- åæ (Unicast) + 倿 (GEM - Good Enough Multicast)
- å¯é ç¹å¯¹ç¹éä¿¡
Membership ManagerïŒæå管çåšïŒïŒ
- è¿è¡å€é¶æ®µ Regroup ç®æ³èŸŸæå ±è¯
- äœ¿çš Gossip åè®®
- æ¯æçåºæ¯ïŒèç¹å å ¥ãèç¹æ éãéä¿¡é®é¢ãå šçœæ Œæææ é
Regroup è¿çšïŒéç»ïŒïŒ
Opening â Closing â Pruning â PruneAck â GemRepair â Cleanup â Stable
â â â â â â â
â â â â â â ââ çš³å®ç¶æ
â â â â â ââ æž
çæ§ç¶æ
â â â â ââ ä¿®å€å€æ
â â â ââ 确讀修åª
â â ââ ç§»é€æ
éèç¹
â ââ å
³éæ§æåè§åŸ
ââ åŒå§æ°çæåå
³ç³»åå
Global Update Manager (GUM)ïŒ
- ææé çœ®æŽæ°çåºåååååæ§ä¿è¯
- æŽæ°å åºçšå° Locker èç¹ïŒæååååéå°å ¶ä»èç¹
- èç¹å¿ é¡»è·å GUM éæèœååžæŽæ°
Resource Control Manager (RCM)ïŒ
- å®ç°æ é蜬移æºå¶åçç¥
- 绎æ€èµæºç¶æïŒOnline / Offline / Failed / Online Pending / Offline Pending
- 绎æ€èµæºç»ç¶æïŒOnline / Offline / Partial Online / Failed
- 建ç«å绎æ€èµæºäŸèµæ
4. ä»²è£ (Quorum) â 矀éç倧è
4.1 ä»ä¹æ¯ QuorumïŒ
QuorumïŒä»²è£ïŒæ¯çŸ€éçšæ¥å³å®æ¯åŠæè¶³å€çâæç¥šâæ¥ä¿ææå¡åšçº¿çå ±è¯æºå¶ã
ð ç±»æ¯ïŒQuorum 忝è£äºäŒæç¥šãå¿ é¡»æè¶ è¿åæ°çè£äºå°åºïŒèŸŸå°æ³å®äººæ°ïŒïŒå³è®®æææã矀éä¹äžæ · â å¿ é¡»æè¶³å€çèç¹âæç¥šâåæïŒæå¡æèœäžçº¿ã
4.2 ä»²è£æš¡å
| æš¡å | æç¥šç»æ | éçšåºæ¯ |
|---|---|---|
| Node Majority | ä» èç¹æç¥š | 奿°èç¹ |
| Node + Disk Majority | èç¹ + è§è¯ç£ç | å¶æ°èç¹ + å ±äº«ç£ç |
| Node + File Share Majority | èç¹ + æä»¶å ±äº«è§è¯ | å€ç«ç¹çŸ€é |
| No Majority (Disk Only) | ä» ç£ç | äžæšè |
åºæ¬å ¬åŒïŒ
æ»æç¥šæ°åºäžºå¥æ°
èç¹æ°äžºå¶æ° â å äžäžªè§è¯ïŒç£çææä»¶å
±äº«ïŒ
èç¹æ°äžºå¥æ° â å¯ä»¥äžéèŠè§è¯
矀éå¯çšæ¡ä»¶ïŒå掻æç¥šæ° > æ»æç¥šæ° / 2
4.3 åšæä»²è£ (Dynamic Quorum)
Windows Server 2012 åŒå ¥ïŒé»è®€å¯çšã
- 矀éæç»çæ§æåç¶æïŒåšæè°æŽæç¥šæé
- å 讞矀éåšè¶ è¿ 50% èç¹å€±èŽ¥åä»ç¶è¿è¡
- ç论äžçŸ€éå¯ä»¥åšä» äžäžªèç¹è¿è¡æ¶ç»§ç»æäŸæå¡
- 被称䞺 âLast Man StandingâïŒæåäžäººç«ç«ïŒ æš¡å
# æ¥ç Dynamic Quorum ç¶æ
(Get-Cluster).DynamicQuorum # 1 = å¯çš
# æ¥çåœå仲è£é
眮
Get-ClusterQuorum
# è®Ÿçœ®ä»²è£æš¡å
Set-ClusterQuorum -NodeAndDiskMajority "Cluster Disk 1"
Set-ClusterQuorum -NodeAndFileShareMajority "\\fileserver\share"
4.4 Force Quorum å Prevent Quorum
| é项 | åœä»€ | çšé |
|---|---|---|
| Force Quorum (/FQ) | net start clussvc /FQ |
区å¶å¯åšçŸ€éæå¡ïŒå³äœ¿æç¥šäžè¶³ïŒDR ç«ç¹åºæ¯ïŒ |
| Prevent Quorum (/PQ) | net start clussvc /PQ |
å¯åšçŸ€éæå¡äœäžå 讞圢æçŸ€éïŒåªèœå å ¥ç°æçŸ€é |
4.5 èç¹æé (Node Weighting)
çšäºå€ç«ç¹çŸ€éåºæ¯ïŒæ§å¶åªäºèç¹åäžä»²è£è®¡ç®ïŒ
# æ¥çèç¹æé
(Get-ClusterNode "Node1").NodeWeight
# 讟眮èç¹æé䞺 0ïŒäžåäžæç¥šïŒ
(Get-ClusterNode "DRNode1").NodeWeight = 0
4.6 Paxos æ çŸ
矀éäœ¿çš Paxos äžèŽæ§ç®æ³è·èžªé çœ®å¯æ¬ïŒ
<PaxosTag> 2026/03/13-14:25:22.523_41:2026/03/13-14:25:22.523_41:13192 </PaxosTag>
5. 矀éçœç»
5.1 NetFT èæéé åš
- PnP 驱åšïŒåšè®Ÿå€ç®¡çåšäžæŸç€ºäžºçœç»éé åš
- MAC å°ååºäºç¬¬äžäžªç©ç NIC
- æäŸå®¹éå€è·¯åŸéä¿¡
- äž IPsecãDHCPãäž»æºé²ç«å¢äºæäœ
5.2 å¿è·³æºå¶ (Heartbeat)
èç¹A èç¹B
â â
âââââ UDP 3343 (Heartbeat) ââââ⺠â
âââââ UDP 3343 (Heartbeat) ââââ â
â â
â åŠæè¿ç» N æ¬¡æ²¡ææ¶å°å¿è·³ â
â â æ 记对æ¹äžº Unreachable â
â â è§Šå Regroup è¿çš â
å¿è·³åæ°ïŒ
| åæ° | é»è®€åŒ | 诎æ |
|---|---|---|
| SameSubnetDelay | 1000ms | ååçœå¿è·³éŽé |
| CrossSubnetDelay | 1000ms | è·šåçœå¿è·³éŽé |
| SameSubnetThreshold | 10 | ååçœäž¢å€±å¿è·³æ¬¡æ°éåŒ |
| CrossSubnetThreshold | 20 | è·šåçœäž¢å€±å¿è·³æ¬¡æ°éåŒ |
# æ¥çå¿è·³è®Ÿçœ®
(Get-Cluster).SameSubnetThreshold
(Get-Cluster).CrossSubnetThreshold
# è°æŽå¿è·³éåŒïŒæŽå®œå®¹ïŒ
(Get-Cluster).SameSubnetThreshold = 20
(Get-Cluster).CrossSubnetThreshold = 40
5.3 çœç»è§è²
| è§è² | åŒ | 诎æ |
|---|---|---|
| Private (ä» çŸ€é) | Role 1 | ä» å éšçŸ€ééä¿¡ïŒäžç»å®çŸ€é IPïŒæ¿èœœ CSV å Live Migration æµé |
| Public (客æ·ç«¯+矀é) | Role 3 | 客æ·ç«¯è®¿é® + å éšçŸ€ééä¿¡ |
| Not Used | Role 0 | 矀éäžäœ¿çšïŒäžçæ§å¥åº·ç¶æ |
â ïž çŸ€éå å ¥æ¶äœ¿çš TCPïŒè¿è¡æ¶å¿è·³äœ¿çš UDP 3343ã
6. 矀éååš
6.1 å ±äº«ååšèŠæ±
- è³å°äž€äžªèç¹å¯ä»¥è®¿é®
- å¿ é¡»æ¯æ SCSI-3 æä¹ é¢ç (Persistent Reservations)
- æ¯æçç±»åïŒåºæ¬ç£ç (MBR) å GPT
æ¯æçååšæ¥å£ïŒ
| æ¥å£ | 诎æ |
|---|---|
| Fibre Channel (FC) | æäŒ ç»ç SAN è¿æ¥ |
| Serial Attached SCSI (SAS) | çŽè¿å ±äº«ååš |
| iSCSI | åºäº IP çœç»çååš |
| Storage Spaces Direct (S2D) | äœ¿çšæ¬å°ç£ççè¶ èåæ¹æ¡ |
| Shared VHDX | èææºå ±äº«ç£ç |
6.2 ç£çé犻 (Disk Fencing)
ç£çéçŠ»æ¯æ§å¶ç£çè®¿é®æéçè¿çšïŒ
- 矀éç£ç驱åšä» PnP 管çåšæ¶å°æ°ç£çéç¥
- 确讀䞺矀éç£çåïŒPartMgr.sys æ§è¡é犻
- 䜿çš
DISK_ONLINEåDISK_OFFLINEIOCTL - äž»åšèç¹äžç£ç OnlineïŒè¢«åšèç¹äžç£ç Offline
6.3 Cluster Shared Volumes (CSV)
CSV æ¯çŸ€éååšçæ žå¿ææ¯ïŒ
ââââââââââââââââââââââââââââââââââââââââââââââââ
â CSV æ¶æ â
ââââââââââââââââââââââââââââââââââââââââââââââââ€
â â
â Node 1 (åè°è
/Owner) Node 2 â
â âââââââââââââââ âââââââââââââââ â
â â å
æ°æ®æäœ â â çŽæ¥ I/O â â
â â (NTFS å
æ°æ® âââââââââ (读å VHD â â
â â éè¿æ€èç¹) â SMB â çŽæ¥å°ç£ç) â â
â ââââââââ¬âââââââ âââââââââââââââ â
â â â
â ⌠â
â âââââââââââââââââââââââââââ â
â â å
±äº«ååš (LUN) â â
â â C:\ClusterStorage\xxx â â
â âââââââââââââââââââââââââââ â
â â
â ç¹ç¹ïŒ â
â ⢠ææèç¹å¹¶åè®¿é® â
â ⢠å
æ°æ®æäœè·¯ç±å°åè°è
èç¹ â
â â¢ æ°æ® I/O çŽæ¥åçïŒé«æ§èœïŒ â
â ⢠åºäº NTFS æä»¶ç³»ç» â
â ⢠æèœœç¹: C:\ClusterStorage\ â
ââââââââââââââââââââââââââââââââââââââââââââââââ
CSV 管çåœä»€ïŒ
Add-ClusterSharedVolume -Name "Cluster Disk 2"
Get-ClusterSharedVolume
Move-ClusterSharedVolume -Name "Cluster Disk 2" -Node "Node2"
Remove-ClusterSharedVolume -Name "Cluster Disk 2"
CSV ç»Žæ€æ³šæïŒ
- è¿è¡ CHKDSKãDefragãShrink éèŠå°å·çœ®äº Redirected Access æš¡åŒ
- å¿ é¡»ä»åè°è èç¹åèµ·
7. èµæºç®¡ç
7.1 èµæºç¶æ
âââââââââââââ
âââââºâ Online ââââââ
â âââââââ¬ââââââ â
â â æ
é â æ¢å€
â ⌠â
ââââââŽâââââ âââââââ ââââŽâââââââ
â Online â âFailedâ â Offline â
â Pending â ââââ¬âââ â Pending â
âââââââââââ â ââââââââââââ
â éå¯/蜬移
âŒ
âââââââââââââ
â Offline â
âââââââââââââ
7.2 èµæºäŸèµ (Dependencies)
èµæºå¯ä»¥äŸèµåäžèµæºç»å çå ¶ä»èµæºïŒæ¯æ AND å OR æäœç¬ŠïŒ
æä»¶æå¡åšèµæºäŸèµæ 瀺äŸïŒ
âââââââââââââââââââââââ
File Server Role
â
ââââââŽâââââ
â AND â
âââââââââââ€
â â
Network Physical
Name Disk
â
â (OR)
âââââââââââââ
â â
IPv4 IPv6
Address Address
7.3 èµæºçç¥ (é»è®€åŒ)
| çç¥ | é»è®€åŒ | 诎æ |
|---|---|---|
| é坿¬¡æ° | 1 次 | èµæºæ éååšåœåèç¹éå¯çæ¬¡æ° |
| éå¯çªå£ | 15 åé | åŠæåšæ€æ¶éŽå 忬¡æ éïŒæ§è¡æ é蜬移 |
| æ é蜬移çç¥ | 蜬移æŽäžªèµæºç» | å°åŠäžäžªèç¹ |
| éè¯éŽé | 1 å°æ¶ | ææèç¹éœå€±èŽ¥åïŒçåŸ å€ä¹ åè¯ |
| Pending è¶ æ¶ | 3 åé | èµæºåš Pending ç¶æçæå€§æ¶éŽ |
| åºæ¬å¥åº·æ£æ¥ | æ¯ 5 ç§ | IsAlive æ£æ¥ |
| 深床å¥åº·æ£æ¥ | æ¯ 60 ç§ | LooksAlive æ£æ¥ |
7.4 RHS æ»éæ£æµäžæ¢å€
RHS æ»éæ¢å€æµçšïŒ
âââââââââââââââââââ
1. RHS è°çšèµæº DLL çå
¥å£ç¹
â
âŒ
2. RHS çåŸ
DeadlockTimeout (5 åé) çèµæºååº
â è¶
æ¶
âŒ
3. 矀éæå¡ç»æ¢ RHS è¿çšä»¥æ¢å€
â
âŒ
4. 矀éæå¡çåŸ
DeadlockTimeout à 4 (20 åé)
ç RHS è¿çšç»æ¢
â è¶
æ¶
âŒ
5. NetFT è§Šåèå± STOP 0x9E 以æ¢å€
(å 䞺 RHS è¿çšæ æ³ç»æ¢)
2012 æ¹è¿ïŒ
- Resource Re-attachïŒå¥åº·èµæºéæ°éå å°æ° RHSïŒæ ééå¯
- æ žå¿èµæºé犻ïŒå çœ®èµæº â Core RHSïŒPhysical Disk â Storage RHSïŒç¬¬äžæ¹ â ç¬ç« RHS
7.5 èµæºç»æ éèœ¬ç§»å±æ§
| 屿§ | é»è®€åŒ | 诎æ |
|---|---|---|
| Preferred Owners | æ ïŒææèç¹åå¯ïŒ | è®Ÿçœ®åæ§å¶æ éèœ¬ç§»ç®æ |
| æ éèœ¬ç§»æ¬¡æ° | èç¹æ° - 1 | æ¯äžªèç¹å°è¯äžæ¬¡ |
| æ éèœ¬ç§»åšæ | 6 å°æ¶ | åŠæåšæ€æ¶éŽå è¶ è¿æ é蜬移次æ°ïŒèµæºç»ä¿æ Failed |
| èªåšåå€ (Failback) | äžèªåšåå€ | èç¹æ¢å€åäžèªåšè¿å |
8. 矀éäž Active Directory
8.1 CNO å VCO å ³ç³»
Active Directory
âââ CNO (Cluster Name Object)
â âââ çŸ€éæ¬èº«çè®¡ç®æºå¯¹è±¡
â äŸ: CLUSTER01$
â âââ åå»ºçŸ€éæ¶èªåšå建
â âââ çšäºæŽæ° VCO å¯ç
â âââ å¿
é¡»æ SELF å®å
šæ§å¶æé
â
âââ VCO (Virtual Computer Object)
âââ 矀éè§è²/æå¡çè®¡ç®æºå¯¹è±¡
äŸ: FILESERVER01$, SQLCLUSTER01$
âââ å建矀éè§è²æ¶èªåšå建
âââ CNO å¿
é¡»æææŽæ° VCO å¯ç
âââ 客æ·ç«¯éè¿ VCO 访é®çŸ€éæå¡
8.2 Event 1207 â AD æéé®é¢
ç°è±¡ïŒçŸ€éæ æ³æŽæ°è®¡ç®æºå¯¹è±¡å¯ç
ææ¥æ¥éª€ïŒ
- é 读äºä»¶äžçææé误信æ¯ïŒæ³šæ error code
- CNO å¿ é¡»æ SELF å®å šæ§å¶
- CNO å¿ é¡»æ VCO çå®å šæéïŒå¯ä»¥æŽæ° VCO å¯ç ïŒ
- æ£æ¥ DC 忥åçœç»è¿æ¥
- åèïŒEvent ID 1207 â Active Directory Permissions for Cluster Accounts
â 䞺ä»ä¹èµæºä»ç¶åšçº¿ïŒ å¯ç æŽæ°å€±èŽ¥äžäŒç«å³åœ±åèµæºïŒäœé¿ææªæŽæ°å¯èœå¯ŒèŽ Kerberos 讀è¯å€±èŽ¥ã
9. 管çå·¥å ·äžæäœ
9.1 管çå·¥å ·
| å·¥å · | 诎æ |
|---|---|
| Failover Cluster Manager | GUI 管ççé¢ |
| PowerShell | ææšèçåœä»€è¡å·¥å ·ïŒåèœæå š |
| cluster.exe | æ§çåœä»€è¡å·¥å ·ïŒ2012+ é»è®€äžå®è£ ïŒ |
9.2 å ³é® PowerShell åœä»€
# ===== 矀é管ç =====
New-Cluster -Name "Cluster01" -Node "Node1","Node2" -StaticAddress 10.0.0.10
Get-Cluster | Format-List *
Test-Cluster -Node "Node1","Node2"
# ===== èç¹ç®¡ç =====
Get-ClusterNode
Suspend-ClusterNode -Name "Node1" -Drain # æåèç¹ïŒæç©ºå·¥äœèŽèœœïŒ
Resume-ClusterNode -Name "Node1" # æ¢å€èç¹
# ===== èµæºç»ç®¡ç =====
Get-ClusterGroup
Move-ClusterGroup "MyFileServer" -Node "Node2"
Get-ClusterNode "Node3" | Get-ClusterGroup | Move-ClusterGroup # 移走ææ
# ===== èµæºç®¡ç =====
Get-ClusterResource
Get-ClusterGroup "GroupName" | Get-ClusterResource | Get-ClusterResourceDependency
# ===== ååšç®¡ç =====
Get-ClusterGroup "Available Storage" | Get-ClusterResource
Add-ClusterSharedVolume -Name "Cluster Disk 2"
# ===== æ¥å¿æ¶é =====
Get-ClusterLog -Destination "C:\temp" -UseLocalTime
Set-ClusterLog -Size 300 # 讟眮æ¥å¿å€§å° (MB)
Set-ClusterLog -Level 3 # 讟眮æ¥å¿çº§å«
# ===== 仲è£ç®¡ç =====
Get-ClusterQuorum
Set-ClusterQuorum -NodeAndDiskMajority "Cluster Disk 1"
# ===== æ·»å é«å¯çšè§è² =====
Add-ClusterFileServerRole -Storage "Cluster Disk 3" -Name "FS01" -StaticAddress 10.0.0.31
9.3 è¡¥äžäžæ»åšæŽæ°
æåšæ»åšæŽæ°æ¥éª€ïŒ
1. å°ææèµæºç§»çŠ»èŠæŽæ°çèç¹
2. åš Failover Cluster Manager äžæå该èç¹
3. åºçšè¡¥äž/çä¿®å€ïŒæééå¯
4. åæ¶æåèç¹
5. 对å
¶äœèç¹é倿¥éª€ 1-4
ð¡ æäœ³å®è·µïŒæè¡¥äžåå é坿å¡åšïŒç¡®ä¿å¹²åå¯åšãè¿æ ·å¯ä»¥é¿å å·²æé®é¢è¢«é误åœåäºè¡¥äžã
9.4 Cluster Aware Updating (CAU)
CAU æ¯çŸ€éæç¥çèªåšæŽæ°åèœïŒ
- èªåšæç©ºå·¥äœèŽèœœïŒLive/Quick MigrationïŒ
- å°èç¹çœ®äºç»Žæ€æš¡åŒ
- äžèœœå®è£ è¡¥äžïŒæééå¯
- åæ¶ç»Žæ€æš¡åŒïŒè¿åå·¥äœèŽèœœ
- éèç¹éå€çŽå°ææèç¹æŽæ°å®æ¯
- æ¯æ WU/MU æ WSUS
# æåšè§Šå CAU
Invoke-CauRun -ClusterName "Cluster01"
# æ¶é CAU æ¥å¿
Save-CauDebugTrace -ClusterName "Cluster01"
10. æ éææ¥ (Troubleshooting)
10.1 è¯æå·¥å ·
| å·¥å · | çšé | äœçœ®/åœä»€ |
|---|---|---|
| çŸ€éæ¥å¿ | æäž»èŠçææ¥å·¥å · | Get-ClusterLog -Destination C:\temp |
| äºä»¶æ¥å¿ | ç³»ç»çº§äºä»¶ | Application and Services Logs â Microsoft â Windows â FailoverClustering |
| 矀ééªè¯ | æ£æ¥é 眮æ¯åŠæ£ç¡® | Test-Cluster |
| SDP | èªåšæ¶éè¯ææ°æ® | æ¿ä»£äº MPSreport |
| çœç»çè§åš | æå åæ | Network Monitor / Wireshark |
| æ§èœè®¡æ°åš | æ§èœçæ§ | Performance Monitor |
10.2 çŸ€éæ¥å¿è¯Šè§£
æ¥å¿äœçœ®: %systemroot%\Cluster\Reports
é»è®€å€§å°: 300 MB (WS2012)
é»è®€çº§å«: 3
æ¶éŽæ ŒåŒ: UTC (é»è®€)ïŒå¯çš -UseLocalTime æŸç€ºæ¬å°æ¶éŽ
# æ¶éæ¥å¿
Get-ClusterLog -Destination C:\temp -UseLocalTime
# è°æŽæ¥å¿å€§å°å级å«
Set-ClusterLog -Size 500
Set-ClusterLog -Level 5 # æŽè¯Šç»
# æ¥å¿æ ŒåŒç€ºäŸïŒ
# PID.TID::YYYY/MM/DD-HH:MM:SS.mmm LEVEL [Component] Message
00003750.00003150::2026/03/13-16:39:30.381 INFO [IM] got event: Remote endpoint 10.1.1.71:~3343~ unreachable
10.3 åžžè§é®é¢äžææ¥
Event 1069 â 矀éèµæºå€±èŽ¥
ç°è±¡ïŒçŸ€éæå¡æåºçšäžçèµæºå€±èŽ¥
ææ¥ïŒ
# 1. æ¥çåªäžªèµæºå€±èŽ¥
Get-ClusterResource | Where-Object {$_.State -eq "Failed"}
# 2. æ¥çèµæºè¯Šæ
Get-ClusterResource "ResourceName" | Format-List *
# 3. æ£æ¥äŸèµå
³ç³»
Get-ClusterResourceDependency "ResourceName"
# 4. æ¥ççŸ€éæ¥å¿äžè¯¥èµæºç诊ç»é误
Get-ClusterLog -Destination C:\temp
# æçŽ¢èµæºåç§°æŸå°çžå
³æ¡ç®
Event 1135 â èç¹è¢«ç§»é€ / å¿è·³äž¢å€±
è¿æ¯æåžžè§ç矀éé®é¢ä¹äžïŒ
éŠå 倿ïŒè¿æ¯åå§é误ïŒè¿æ¯ä»¥äžåå çç»æïŒ
- èç¹éå¯
- 矀éæå¡éå¯
- æå¡åšæèµ·
åŠææ¯å¿è·³äž¢å€±ïŒæå žåæ åµïŒïŒ
ææ¥æ¥éª€ïŒAction PlanïŒïŒ
âââââââââââââââââââââââ
1. 确讀 UDP 3343 æ¯åŠè¢«é»æ¢
âââ æ£æ¥é²ç«å¢è§å
2. æ£æ¥çœç»é®é¢
âââ çœç»äž¢å
ãå»¶è¿ãæè¿
3. NIC åžèœœè®Ÿçœ®ïŒç¹å«æ¯èæåç¯å¢ïŒ
âââ å°è¯çŠçš RSSãVMQãTCP Chimney
4. åŠæäœ¿çš NIC Teaming
âââ å°è¯æå Teaming é犻é®é¢
5. å®è£
ææ°çŸ€éè¡¥äž
âââ clussvc, tcpip, ndis çžå
³
6. è°æŽå¿è·³éåŒ
âââ å¢å€§ SameSubnetThreshold / CrossSubnetThreshold
7. åæçœç»æå
âââ æ¥çå¿è·³å
æ¯åŠæäž¢å€±
çŸ€éæ¥å¿ç€ºäŸïŒ
[IM] got event: Remote endpoint 10.1.1.71:~3343~ unreachable from 10.1.1.72:~3343~
[IM] Marking Route from 10.1.1.72:~3343~ to 10.1.1.71:~3343~ as down
[NDP] All routes for route (virtual) local 169.254.2.213:~0~ to remote 169.254.1.57:~0~ are down
Event 1207 â æ æ³æŽæ°è®¡ç®æºèŽŠå·å¯ç
ææ¥æ¥éª€ïŒ
1. ä»ç»é
读äºä»¶äžçææä¿¡æ¯ïŒæ³šæ error code
2. CNO å¿
é¡»æ SELF å®å
šæ§å¶
3. CNO å¿
é¡»æ VCO çå®å
šæé
4. 泚æ DC 忥åçœç»é®é¢
Event 5120 â CSV é®é¢
ææ¥æ¥éª€ïŒ
1. 泚æäºä»¶æ¥å¿äžç reason code
äŸ: STATUS_CLUSTER_CSV_AUTO_PAUSE_ERROR(c0130021)
2. å®è£
ææ°çŸ€éè¡¥äž
3. æ£æ¥ HBAãSAN åååšé®é¢
âââ æŽæ°é©±åšååºä»¶
4. æ¯åŠåšå€ä»œè¿è¡æ¶åçïŒ
5. æ£æ¥ååšè¿æ¥æ§
Event 1146 â RHS æå€åæ¢
ææ¥æ¥éª€ïŒ
1. çŸ€éæ¥å¿ â æ¥ç诊ç»ä¿¡æ¯
2. RHS è¿çš dump â åæåŽ©æºåå
3. ç³»ç»å
å dump â æ·±å
¥åæïŒäœäŒå¯ŒèŽèå±ïŒ
âââ èå±å¯èœæ¯å¯æ¥åçïŒå äžºèµæºäŒæ
é蜬移
é
眮 dump æ¶éïŒ
- 2008/2008R2: é
眮 WER user dump å OS dump
- 2012/2012R2: å¯éè¿æ³šå衚é
眮
Bugcheck 0x9E
è¿æ¯ NetFT è§Šåçèå±ïŒéåžžå 䞺 RHS æ»éæ æ³æ¢å€ïŒ
åè: Decoding Bugcheck 0x0000009E
https://blogs.msdn.com/b/clustering/archive/2013/11/13/10467483.aspx
10.4 ååšé®é¢éçšææ¥
ååšææ¥å³çæ ïŒ
âââââââââââââââ
1. ä»ä¹æ¯åºå±ååšæ¶æïŒ
âââ iSCSI? SAN (FC/SAS)? èæå? è·šå°å?
2. é®é¢æ¯ OS å
éšè¿æ¯å€éšå¯ŒèŽïŒ
âââ æ¹æ³ïŒçŠçšçŸ€éç£çé©±åšæ¥é犻
3. ç£ççŸåæ¯åŠæ¹åïŒ
4. è¿è¡çŸ€ééªè¯ (Test-Cluster)
5. èç³»ååšåå
âââ 䜿çšååå·¥å
·æ£æ¥ç£çç¶æ (åŠ PowerPath)
6. è·šå°å矀éç¹å«æ³šæ
âââ æäºçç¶å¯èœæ¯é¢æè¡äžº
11. å€ç«ç¹çŸ€é (Multi-Site Clusters)
å ³é®èè
| 绎床 | 泚æäºé¡¹ |
|---|---|
| çœç» | ç«ç¹éŽå»¶è¿ã垊宜ãè·¯ç± |
| ä»²è£ | äœ¿çšæä»¶å ±äº«è§è¯æŸåšç¬¬äžäžªç«ç¹ïŒäœ¿çšèç¹æéæ§å¶æ é蜬移è¡äžº |
| ååš | éèŠååšçº§å«çèæå/å€å¶ |
| å¿è·³ | å¢å€§ CrossSubnetThreshold å CrossSubnetDelay |
| æ é蜬移 | æåš vs èªåšïŒäœ¿çš /FQ å /PQ æ§å¶ |
English Version
1. Overview â What is Failover Clustering
Failover Clustering is a group of independent servers working together as a single system, providing High Availability (HA) for mission-critical applications.
What a Cluster Provides
| Capability | Description |
|---|---|
| High Availability | Eliminates single points of failure |
| Failover | Automatically migrates services to healthy nodes on failure |
| Load Balancing | Distributes workloads in Active/Active configurations |
| Zero Downtime Maintenance | Rolling updates without service interruption |
Failover Clustering vs NLB
| Dimension | Failover Clustering | Network Load Balancing (NLB) |
|---|---|---|
| Purpose | Application-level HA | Network traffic load balancing |
| Shared Storage | Required (traditional) | Not required |
| Use Cases | Databases, File Servers, Exchange | Web Servers, IIS |
| Max Nodes | 64 (2012+) | 32 |
2. Core Terminology
| Term | Description |
|---|---|
| Cluster | Group of independent servers working as one system |
| Node | Each server in the cluster |
| Resource | Smallest unit managed by the cluster (IP, disk, network name, service) |
| Resource Group | Collection of related resources that failover as a unit |
| Failover | Automatic migration of resources from failed to healthy node |
| Failback | Migration of resources back to original node after recovery |
| CNO | Cluster Name Object â the clusterâs AD computer account |
| VCO | Virtual Computer Object â each cluster roleâs AD computer account |
| Witness | Extra voting resource (disk or file share) for quorum |
| RHS | Resource Host Subsystem â hosts and monitors resource DLLs |
| NetFT | Cluster virtual network adapter providing fault-tolerant communications |
3. Cluster Architecture
Three-Tier Architecture
ââââââââââââââââââââââââââââââââââââââââââââââ
â Top Tier: Cluster Abstractions â
â Nodes, Groups, Resources, Policies â
ââââââââââââââââââââââââââââââââââââââââââââââ€
â Middle Tier: Cluster Operation â
â Membership, Regroup, GUM, Consistency â
ââââââââââââââââââââââââââââââââââââââââââââââ€
â Bottom Tier: OS Interaction â
â PartMgr, ClusDisk, NetFT, NTFS, Security â
ââââââââââââââââââââââââââââââââââââââââââââââ
Key Components
| Component | Role |
|---|---|
| Messaging | All intra-cluster communication (Unicast + GEM Multicast) |
| Membership Manager | Regroup algorithm, Gossip protocol, node join/failure |
| Global Update Manager (GUM) | Serialized atomic updates, Locker node model |
| Resource Control Manager | Failover policies, resource dependency trees, state management |
| Database Manager | Fault-tolerant registry-based database, Paxos consensus |
| Quorum Manager | Determines if cluster has quorum, tracks all replicas |
| Host Manager | TCP 3343 connections, security handshake, NetFT routing |
| Topology Manager | Network topology discovery and maintenance |
| Security Manager | SSPI handshake, message signing/encryption |
| NetFT Driver | Fault-tolerant multi-path communication between nodes |
Regroup Process
Opening â Closing â Pruning â PruneAck â GemRepair â Cleanup â Stable
This multi-stage algorithm ensures all nodes reach consensus on cluster membership after any topology change.
4. Quorum â The Brain of the Cluster
Quorum Models
| Model | Votes | Best For |
|---|---|---|
| Node Majority | Nodes only | Odd number of nodes |
| Node + Disk Majority | Nodes + witness disk | Even nodes + shared disk |
| Node + File Share Majority | Nodes + file share witness | Multi-site clusters |
| No Majority (Disk Only) | Disk only | Not recommended |
Dynamic Quorum (2012+)
- Continuously adjusts vote weights based on active membership
- Allows cluster to survive with >50% nodes down
- âLast Man Standingâ â can theoretically run on a single node
- Enabled by default:
(Get-Cluster).DynamicQuorum = 1
Force Quorum and Prevent Quorum
| Option | Use Case |
|---|---|
/FQ (Force Quorum) |
Force cluster start when insufficient votes (DR scenario) |
/PQ (Prevent Quorum) |
Start service but only allow joining existing cluster |
5. Cluster Networking
Heartbeat Mechanism
- Protocol: UDP port 3343
- Controlled by: NetFT driver
- Node join uses TCP; runtime heartbeat uses UDP
| Parameter | Default | Description |
|---|---|---|
| SameSubnetDelay | 1000ms | Same-subnet heartbeat interval |
| CrossSubnetDelay | 1000ms | Cross-subnet heartbeat interval |
| SameSubnetThreshold | 10 | Missed heartbeats before marking unreachable |
| CrossSubnetThreshold | 20 | Cross-subnet missed heartbeat threshold |
Network Roles
| Role | Value | Description |
|---|---|---|
| Private | 1 | Internal cluster communication only, carries CSV/Live Migration traffic |
| Public | 3 | Client access + internal cluster communication |
| Not Used | 0 | Cluster ignores this network, no health monitoring |
6. Cluster Storage
Requirements
- Shared by at least 2 nodes
- SCSI-3 Persistent Reservations required
- Supported: FC, SAS, iSCSI, S2D, Shared VHDX
Cluster Shared Volumes (CSV)
- Concurrent access from all cluster nodes
- Metadata routed to coordinator (owner) node
- Data I/O written directly to disk (high performance)
- Mount point:
C:\ClusterStorage\ - Based on NTFS
Disk Fencing
- Controls disk access (online/offline per node)
- Handled by PartMgr.sys via
DISK_ONLINE/DISK_OFFLINEIOCTLs - Prevents data corruption from simultaneous uncoordinated access
7. Resource Management
Resource States
Online â Failed â Offline (with Online Pending / Offline Pending transitions)
Default Resource Policies
| Policy | Default |
|---|---|
| Restart count | 1 time |
| Restart window | 15 minutes |
| Failover action | Move entire group |
| Retry interval | 1 hour |
| Pending timeout | 3 minutes |
| Basic health check | Every 5 seconds |
| Thorough health check | Every 60 seconds |
RHS Deadlock Recovery
- RHS waits 5 minutes for resource response
- Cluster service terminates RHS process
- Waits 20 minutes for RHS to terminate
- If RHS wonât terminate â NetFT triggers STOP 0x9E (bugcheck)
8. Troubleshooting
Key Diagnostic Tools
| Tool | Command |
|---|---|
| Cluster Log | Get-ClusterLog -Destination C:\temp -UseLocalTime |
| Validation | Test-Cluster -Node Node1,Node2 |
| Event Logs | FailoverClustering channels in Event Viewer |
| Log Size/Level | Set-ClusterLog -Size 500, Set-ClusterLog -Level 5 |
Common Issues
Event 1135 â Node Removed (Heartbeat Loss)
Action Plan:
- Check if UDP 3343 is blocked (firewall)
- Check network issues (packet loss, latency)
- Review NIC offload settings (disable RSS, VMQ in virtual environments)
- Break NIC teaming to isolate
- Install latest cluster patches (clussvc, tcpip, ndis)
- Tune heartbeat thresholds (increase SameSubnetThreshold/CrossSubnetThreshold)
- Analyze network captures
Event 1069 â Resource Failure
- Identify which resource failed
- Check dependencies
- Review cluster log for detailed error
Event 1207 â AD Permission Issues
- CNO needs SELF full control
- CNO needs full permission on VCO
- Check DC sync and network
Event 5120 â CSV Issues
- Note the reason code in event log
- Install latest patches
- Check HBA/SAN/storage
- Check if backup was running
Event 1146 â RHS Crash
- Cluster log for details
- RHS process dump for crash analysis
- System memory dump for thorough analysis (causes bugcheck)
Bugcheck 0x9E
- Caused by NetFT when RHS deadlock cannot be recovered
- RHS process failed to terminate within 20 minutes
Storage Troubleshooting Checklist
- Identify storage infrastructure (iSCSI? SAN? Virtual? Geo?)
- Isolate: OS internal vs external (disable cluster disk driver to test)
- Check disk signature changes
- Run cluster validation (
Test-Cluster) - Involve storage vendor
- Special care for geo-clusters (some symptoms may be expected)
9. References
- Windows Server Failover Clustering training materials (M01-M15, S01-S09)
- Failover Clustering Overview