我的生产集群自 4 月 16 日起启用了修复服务,默认修复时间为 9 天,修复将正常完成。但是,自 5 月 22 日起,Opscenter 自动禁用该服务:
来自 /var/log/opscenter/opscenterd.log:
[...]
2014-06-03 21:13:47-0400 [zs_prod] ERROR: Repair task (<Node 10.1.0.22='6417880425364517165'>, (-4019838962446882275L, -4006140687792135587L), set(['zs_logging', 'OpsCenter'])) timed out after 3600 seconds.
2014-06-03 22:16:44-0400 [zs_prod] ERROR: Repair task (<Node 10.1.0.22='6417880425364517165'>, (-4006140687792135587L, -4006140687792135586L), set(['zs_logging', 'OpsCenter'])) timed out after 3600 seconds.
2014-06-03 22:16:44-0400 [zs_prod] ERROR: More than 100 errors during repair service, shutting down repair service
2014-06-03 22:16:44-0400 [zs_prod] INFO: Stopping repair service
[...]
来自 /var/log/opscenter/repair_service/zs_prod.log:
[...]
2014-06-03 22:16:44-0400 [zs_prod] ERROR: Repair task (<Node 10.1.0.22='6417880425364517165'>, (-4006140687792135587L, -4006140687792135586L), set(['zs_logging', 'OpsCenter'])) timed out after 3600 seconds.
2014-06-03 22:16:44-0400 [zs_prod] ERROR: Task (<Node 10.1.0.22='6417880425364517165'>, (-4006140687792135587L, -4006140687792135586L), set(['zs_logging', 'OpsCenter'])) has failed 1 times.
2014-06-03 22:16:44-0400 [zs_prod] ERROR: 101 errors have ocurred out of 100 allowed.
2014-06-03 22:16:44-0400 [zs_prod] ERROR: More than 100 errors during repair service, shutting down repair service
2014-06-03 22:16:44-0400 [zs_prod] INFO: Stopping repair service
在修复失败的节点上,来自 /var/log/cassandra/system.log:
ERROR [RMI TCP Connection(93502)-10.1.0.22] 2014-06-03 20:12:28,858 StorageService.java (line 2560) Repair session failed:
java.lang.IllegalArgumentException: Requested range intersects a local range but is not fully contained in one; this would lead to i
mprecise repair
at org.apache.cassandra.service.ActiveRepairService.getNeighbors(ActiveRepairService.java:164)
at org.apache.cassandra.repair.RepairSession.<init>(RepairSession.java:128)
at org.apache.cassandra.repair.RepairSession.<init>(RepairSession.java:117)
at org.apache.cassandra.service.ActiveRepairService.submitRepairSession(ActiveRepairService.java:97)
at org.apache.cassandra.service.StorageService.forceKeyspaceRepair(StorageService.java:2620)
at org.apache.cassandra.service.StorageService$5.runMayThrow(StorageService.java:2556)
at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
这些错误仅在修复服务运行时才会发生,也是这些节点遇到的唯一错误。除了修复任务之外,Cassandra 集群运行正常。
我正在运行 Opscenter 4.1.2,并在 Linux 虚拟机上安装了 6 个节点的 DSE 4.0.2 集群。这些节点运行 Ubuntu Server 12.04 64 位的原始安装,并且根据提供的安装文档安装和保护了 DSE。
一段时间以来,我也在我的开发集群上遇到了这个问题(使用 DSE 4.0.0、4.0.1 和 4.0.2),但我认为这是因为我配置错误。这个问题也曾在某个时候自发出现。
Cassandra 集群运行非常顺畅,具有良好的写入吞吐量。它非常稳定,并且拥有足够的资源。我们没有发现依赖它的应用程序出现任何问题。
答案1
这是 OpsCenter 的一个已知错误,已在 4.1.3 版本中修复(请参阅http://www.datastax.com/documentation/opscenter/4.1/opsc/release_notes/opscReleaseNotes413.html(最后一期)
我认为除了升级 OpsCenter(这应该很容易做到)之外没有其他解决方法