达梦数据库集群分裂(SPLIT)修复文档
show2026-06-08 17:08:17#================================================================================#GROUP OGUID MON_CONFIRM MODE MPP_FLAGGRP1 45331 TRUE MANUAL FALSEGROUP SPLIT:1: DATABASE(DMSERVER2):<<DATABASE GLOBAL INFO:>>DW_IP MAL_DW_PORT WTIME WTYPE WCTLSTAT WSTATUS INAME INST_OK N_EP N_OK ISTATUS IMODE DSC_STATUS RTYPE RSTAT DETACHED192.168.139.68 52141 2026-06-08 17:08:17 GLOBAL VALID OPEN DMSERVER2 OK 1 1 OPEN PRIMARY DSC_OPEN REALTIME VALID FALSEEP INFO:INST_IP INST_PORT INST_OK INAME ISTATUS IMODE DSC_SEQNO DSC_CTL_NODE RTYPE RSTAT FSEQ FLSN CSEQ CLSN DW_STAT_FLAG192.168.139.68 5236 OK DMSERVER2 OPEN PRIMARY 0 0 REALTIME VALID 8626 1573852 8626 1573852 NONE2: DATABASE(DMSERVER):<<DATABASE GLOBAL INFO:>>DW_IP MAL_DW_PORT WTIME WTYPE WCTLSTAT WSTATUS INAME INST_OK N_EP N_OK ISTATUS IMODE DSC_STATUS RTYPE RSTAT DETACHED192.168.139.115 52141 2026-06-08 17:08:16 GLOBAL SPLIT STARTUP DMSERVER OK 1 1 MOUNT STANDBY DSC_OPEN REALTIME INVALID FALSEEP INFO:INST_IP INST_PORT INST_OK INAME ISTATUS IMODE DSC_SEQNO DSC_CTL_NODE RTYPE RSTAT FSEQ FLSN CSEQ CLSN DW_STAT_FLAG192.168.139.115 5236 OK DMSERVER MOUNT STANDBY 0 0 REALTIME INVALID 8584 1573639 8584 1573639 NONE
同时,备库的 ARCH_IS_VALID 值为 N,表示实时归档配置无效。
这个环境之前通过手动把备库用alter database primary 强制打开,导致备库数据与主库不一致,被集群标记为分裂状态。
修复方法其实都一样,适合如下:
-
备库处于
SPLIT分裂状态 -
备库
RSTAT = INVALID,ARCH_IS_VALID = N -
备库 LSN 落后于主库,无法自动恢复
-
集群无法自动修复,需要人工干预
环境信息
| 角色 | 实例名 | IP地址 | MAL_DW端口 | 数据库端口 |
|---|---|---|---|---|
| 主库 | DMSERVER2 | 192.168.139.68 | 52141 | 5236 |
| 备库 | DMSERVER | 192.168.139.115 | 52141 | 5236 |
| 参数 | 值 |
|---|---|
| 集群组名 | GRP1 |
| OGUID | 45331 |
| 数据目录(备库) | /dm/data/dmdb |
| 备份目录 | /dm/bak |
修复步骤
第一步:停止备库服务
#停止watcher 进程systemctl stop DmWatcherServicewatcher# 2. 确认守护进程已停止ps -ef | grep dmwatcher | grep -v grep# 3. 停止数据库服务systemctl stop DmServicedmdb# 4. 确认数据库进程已停止ps -ef | grep dmserver | grep -v grep
# 查找进程PIDps -ef | grep dmserver | grep -v grep | awk '{print $2}'# 正常终止kill <PID># 等待5秒后检查,如仍存在则强制终止(最后手段)kill -9 <PID>
第二步:清理分裂标记文件
分裂状态由 dmwatcher.ctl 标记文件引起(第一未清理,导致恢复之后还是为分裂状态),需要删除或重命名:
# 进入数据目录(根据实际路径调整)cd /dm/data/dmdb# 查看标记文件ls -la dmwatcher.ctl# 备份并删除标记文件mv dmwatcher.ctl dmwatcher.ctl.bak# 确认删除ls -la dmwatcher.ctl
第三步:在主库执行联机备份
SYSDBA 身份执行备份:-- 登录主库disql SYSDBA/SYSDBA@192.168.139.68:5236-- 执行全库联机备份BACKUP DATABASE BACKUPSET '/dm/bak/db_full_bak_02';-- 查看备份文件SELECT SF_BAKSET_BACKUP_DIR_ADD('DISK', '/dm/bak');SELECT * FROM V$BACKUPSET;
第四步:传输备份文件到备库
# 在主库服务器上执行scp -r /dm/bak/db_full_bak_02 dmdba@192.168.139.115:/dm/bak# 在备库服务器上确认文件已传输ls -la /dm/bak/db_full_bak_02
第五步:在备库执行脱机恢复
dmrmanRMAN> RESTORE DATABASE '/dm/data/dmdb/dm.ini' FROM BACKUPSET '/dm/bak/db_full_bak_02';RESTORE DATABASE '/dm/data/dmdb/dm.ini' FROM BACKUPSET '/dm/bak/db_full_bak_02';file dm.key not found, use default license![Percent:100.00%][Speed:0.00M/s][Cost:00:00:21][Remaining:00:00:00]restore successfully.time used: 00:00:21.601RMAN>RMAN> RECOVER DATABASE '/dm/data/dmdb/dm.ini' FROM BACKUPSET '/dm/bak/db_full_bak_02';RECOVER DATABASE '/dm/data/dmdb/dm.ini' FROM BACKUPSET '/dm/bak/db_full_bak_02';[Percent:100.00%][Speed:0.00PKG/s][Cost:00:00:00][Remaining:00:00:00]recover successfully!time used: 00:00:04.592RMAN> RECOVER DATABASE '/dm/data/dmdb/dm.ini' UPDATE DB_MAGIC;RECOVER DATABASE '/dm/data/dmdb/dm.ini' UPDATE DB_MAGIC;recover successfully!time used: 00:00:01.525RMAN> exit
第六步:配置备库归档并启动
以 Mount 模式启动备库
/dm/bin/dmserver path=/dm/data/dmdb/dm.ini mount
打开新终端窗口,执行以下配置:
配置 OGUID 和数据库模式
-- 在新终端中登录备库disql SYSDBA/SYSDBA@192.168.139.115:5236-- 开启手动模式修改权限SP_SET_PARA_VALUE(1, 'ALTER_MODE_STATUS', 1);-- 设置OGUID,必须与集群一致(此处为45331)SP_SET_OGUID(45331);-- 将数据库角色设置为STANDBYALTER DATABASE STANDBY;-- 关闭手动模式修改权限SP_SET_PARA_VALUE(1, 'ALTER_MODE_STATUS', 0);-- 查看配置结果SELECT NAME, OGUID FROM V$INSTANCE;SELECT STATUS$ FROM V$INSTANCE;
验证归档配置
-- 查看归档配置SELECT ARCH_NAME, ARCH_TYPE, ARCH_DEST, ARCH_IS_VALID FROM V$DM_ARCH_INI;-- 如果ARCH_IS_VALID为N,需要检查dmarch.ini文件-- 查看dm.ini中归档配置是否开启SELECT * FROM V$DM_INI WHERE PARA_NAME = 'ARCH_INI';
正常关闭备库
-- 关闭数据库SHUTDOWN IMMEDIATE;-- 退出disqlEXIT;
以服务方式启动备库
systemctl start DmServicedmdb# 启动守护进程systemctl start DmWatcherServicewatcher# 检查进程状态systemctl status DmServicedmdbsystemctl status DmWatcherServicewatcher# 查看进程ps -ef | grep -E "dmserver|dmwatcher" | grep -v grep
第七步:验证集群状态
在监视器所在服务器上执行:RSTAT 状态都为VALID
show2026-06-08 17:15:06#================================================================================#GROUP OGUID MON_CONFIRM MODE MPP_FLAGGRP1 45331 TRUE MANUAL FALSE<<DATABASE GLOBAL INFO:>>DW_IP MAL_DW_PORT WTIME WTYPE WCTLSTAT WSTATUS INAME INST_OK N_EP N_OK ISTATUS IMODE DSC_STATUS RTYPE RSTAT DETACHED192.168.139.68 52141 2026-06-08 17:15:06 GLOBAL VALID OPEN DMSERVER2 OK 1 1 OPEN PRIMARY DSC_OPEN REALTIME VALID FALSEEP INFO:INST_IP INST_PORT INST_OK INAME ISTATUS IMODE DSC_SEQNO DSC_CTL_NODE RTYPE RSTAT FSEQ FLSN CSEQ CLSN DW_STAT_FLAG192.168.139.68 5236 OK DMSERVER2 OPEN PRIMARY 0 0 REALTIME VALID 8628 1573852 8628 1573852 NONE<<DATABASE GLOBAL INFO:>>DW_IP MAL_DW_PORT WTIME WTYPE WCTLSTAT WSTATUS INAME INST_OK N_EP N_OK ISTATUS IMODE DSC_STATUS RTYPE RSTAT DETACHED192.168.139.115 52141 2026-06-08 17:15:05 GLOBAL VALID OPEN DMSERVER OK 1 1 OPEN STANDBY DSC_OPEN REALTIME VALID FALSEEP INFO:INST_IP INST_PORT INST_OK INAME ISTATUS IMODE DSC_SEQNO DSC_CTL_NODE RTYPE RSTAT FSEQ FLSN CSEQ CLSN DW_STAT_FLAG192.168.139.115 5236 OK DMSERVER OPEN STANDBY 0 0 REALTIME VALID 8628 1573852 8628 1573852 NONE
验证归档有效性:
-- 在主库查询SELECT ARCH_NAME, ARCH_TYPE, ARCH_DEST, ARCH_IS_VALID FROM V$DM_ARCH_INI;LINEID ARCH_NAME ARCH_TYPE ARCH_DEST ARCH_IS_VALID---------- ---------------- --------- --------- -------------1 ARCHIVE_REALTIME REALTIME DMSERVER Y2 ARCHIVE_LOCAL1 LOCAL /dm/arch Y
插入数据验证同步正常。
夜雨聆风