错误的资源管理器数据校验和记录在2/XYZ +由于管理员命令而终止walreceiver进程

问题描述:

我正在运行带有PostgreSQL 9.1(1个主节点,3个从节点)的流式复制环境。一切正常工作aprox。 2个月。昨日,复制到从服务器的一个失败,日志上具有奴:错误的资源管理器数据校验和记录在2/XYZ +由于管理员命令而终止walreceiver进程

LOG: incorrect resource manager data checksum in record at 61/DA2710A7 
FATAL: terminating walreceiver process due to administrator command 
LOG: incorrect resource manager data checksum in record at 61/DA2710A7 
LOG: incorrect resource manager data checksum in record at 61/DA2710A7 
LOG: incorrect resource manager data checksum in record at 61/DA2710A7 
LOG: incorrect resource manager data checksum in record at 61/DA2710A7 
LOG: incorrect resource manager data checksum in record at 61/DA2710A7 
LOG: incorrect resource manager data checksum in record at 61/DA2710A7 
LOG: incorrect resource manager data checksum in record at 61/DA2710A7 

奴隶不再与主机同步。 两个小时后,其中日志变得像每5秒以上的新线,我重新启动从数据库服务器:

LOG: incorrect resource manager data checksum in record at 61/DA2710A7 
LOG: received fast shutdown request 
LOG: aborting any active transactions 
LOG: incorrect resource manager data checksum in record at 61/DA2710A7 
FATAL: terminating connection due to administrator command 
FATAL: terminating connection due to administrator command 
LOG: shutting down 
LOG: database system is shut down 

从属节点上的新日志文件包含:

LOG: database system was shut down in recovery at 2016-02-29 05:12:11 CET 
LOG: entering standby mode 
LOG: redo starts at 61/D92C10C9 
LOG: consistent recovery state reached at 61/DA2710A7 
LOG: database system is ready to accept read only connections 
LOG: incorrect resource manager data checksum in record at 61/DA2710A7 
LOG: streaming replication successfully connected to primary 

现在,从设备与主设备同步,但校验和条目仍然存在。我检查的另一件事是网络日志 - >网络可用。

我的问题是:

  1. 有谁知道为什么walreceiver被终止?
  2. PostgreSQL为什么不重试复制?
  3. 我能做些什么来预防这种情况?

谢谢。

编辑:

数据库服务器与EXT3在SLES 11运行。我发现一篇关于SLES 11低性能的文章,但是我不确定它是否适用,因为我的机器只有8 GB RAM(https://www.novell.com/support/kb/doc.php?id=7010287

任何帮助,将不胜感激。

EDIT(2):

PostgreSQL的版本是9.1.5。似乎PostgreSQL版本9.1.6提供了类似问题的修复?

Fix persistence marking of shared buffers during WAL replay (Jeff Davis) 

This mistake can result in buffers not being written out during checkpoints, resulting in data corruption if the server later crashes without ever having written those buffers. Corruption can occur on any server following crash recovery, but it is significantly more likely to occur on standby slave servers since those perform much more WAL replay. 

来源:http://www.postgresql.org/docs/9.1/static/release-9-1-6.html

也许这是修复?我应该升级到PostgreSQL 9.1.6,一切都会平稳运行吗?

如果有人绊倒了这个问题,我结束了从备份数据重新安装数据库并重新设置复制。从来没有真正明白出了什么问题。

从来没有真正知道出了什么问题。

我遇到了同样的错误 - 只是它从不从一开始就完全同步。

然后,主服务器有一些内核错误(服务器的情况下发热问题?)。由于未完全关闭,服务器需要关闭。已经停机时,从显示了

LOG: incorrect resource manager data checksum in record at 1/63663CB0 

主服务器和从服务器的重启重启后,情况并没有改变:每5秒相同的日志条目。