我们正在尝试将 MongoDB 从 AWS VM 移动到本地机器。
MongoDB数据目录大小超过2.5TB。
这个想法是将内部部署的 VM 与 AWS VM 放在一起作为副本。
问题是几天后副本进程崩溃并且我们本地机器上的所有数据都被销毁:
2020-06-22T01:08:02.850+0200 I ASIO [NetworkInterfaceASIO-RS-0] Ending idle connection to host 192.168.26.122:27017 because the pool meets constraints; 2 connections to that host remain open
2020-06-22T01:08:07.911+0200 I ASIO [NetworkInterfaceASIO-RS-0] Ending connection to host 192.168.26.122:27017 due to bad connection status; 1 connections to that host remain open
2020-06-22T01:08:07.911+0200 I REPL [replication-341] Restarting oplog query due to error: NetworkInterfaceExceededTimeLimit: error in fetcher batch callback: Operation timed out. Last fetched optime (with hash): { ts: Timestamp(1592780789, 1), t: 25 }[6723760701176776417]. Restarts remaining: 3
2020-06-22T01:08:07.912+0200 I ASIO [NetworkInterfaceASIO-RS-0] Connecting to 192.168.26.122:27017
2020-06-22T01:08:07.912+0200 I REPL [replication-341] Scheduled new oplog query Fetcher source: 192.168.26.122:27017 database: local query: { find: "oplog.rs", filter: { ts: { $gte: Timestamp(1592780789, 1) } }, tailable: true, oplogReplay: true, awaitData: true, maxTimeMS: 60000, batchSize: 13981010, term: 25, readConcern: { afterClusterTime: Timestamp(1592780789, 1) } } query metadata: { $replData: 1, $oplogQueryData: 1, $readPreference: { mode: "secondaryPreferred" } } active: 1 findNetworkTimeout: 65000ms getMoreNetworkTimeout: 35000ms shutting down?: 0 first: 1 firstCommandScheduler: RemoteCommandRetryScheduler request: RemoteCommand 839529 -- target:192.168.26.122:27017 db:local cmd:{ find: "oplog.rs", filter: { ts: { $gte: Timestamp(1592780789, 1) } }, tailable: true, oplogReplay: true, awaitData: true, maxTimeMS: 60000, batchSize: 13981010, term: 25, readConcern: { afterClusterTime: Timestamp(1592780789, 1) } } active: 1 callbackHandle.valid: 1 callbackHandle.cancelled: 0 attempt: 1 retryPolicy: RetryPolicyImpl maxAttempts: 1 maxTimeMillis: -1ms
2020-06-22T01:08:07.944+0200 I ASIO [NetworkInterfaceASIO-RS-0] Successfully connected to 192.168.26.122:27017, took 33ms (2 connections now open to 192.168.26.122:27017)
2020-06-22T01:09:07.985+0200 I REPL [replication-341] Restarting oplog query due to error: ExceededTimeLimit: error in fetcher batch callback: operation exceeded time limit. Last fetched optime (with hash): { ts: Timestamp(1592780789, 1), t: 25 }[6723760701176776417]. Restarts remaining: 2
2020-06-22T01:09:07.986+0200 I REPL [replication-341] Scheduled new oplog query Fetcher source: 192.168.26.122:27017 database: local query: { find: "oplog.rs", filter: { ts: { $gte: Timestamp(1592780789, 1) } }, tailable: true, oplogReplay: true, awaitData: true, maxTimeMS: 60000, batchSize: 13981010, term: 25, readConcern: { afterClusterTime: Timestamp(1592780789, 1) } } query metadata: { $replData: 1, $oplogQueryData: 1, $readPreference: { mode: "secondaryPreferred" } } active: 1 findNetworkTimeout: 65000ms getMoreNetworkTimeout: 35000ms shutting down?: 0 first: 1 firstCommandScheduler: RemoteCommandRetryScheduler request: RemoteCommand 839534 -- target:192.168.26.122:27017 db:local cmd:{ find: "oplog.rs", filter: { ts: { $gte: Timestamp(1592780789, 1) } }, tailable: true, oplogReplay: true, awaitData: true, maxTimeMS: 60000, batchSize: 13981010, term: 25, readConcern: { afterClusterTime: Timestamp(1592780789, 1) } } active: 1 callbackHandle.valid: 1 callbackHandle.cancelled: 0 attempt: 1 retryPolicy: RetryPolicyImpl maxAttempts: 1 maxTimeMillis: -1ms
2020-06-22T01:10:12.986+0200 I ASIO [NetworkInterfaceASIO-RS-0] Ending connection to host 192.168.26.122:27017 due to bad connection status; 1 connections to that host remain open
2020-06-22T01:10:12.986+0200 I REPL [replication-341] Restarting oplog query due to error: NetworkInterfaceExceededTimeLimit: error in fetcher batch callback: Operation timed out. Last fetched optime (with hash): { ts: Timestamp(1592780789, 1), t: 25 }[6723760701176776417]. Restarts remaining: 1
2020-06-22T01:10:12.987+0200 I ASIO [NetworkInterfaceASIO-RS-0] Connecting to 192.168.26.122:27017
2020-06-22T01:10:12.987+0200 I REPL [replication-341] Scheduled new oplog query Fetcher source: 192.168.26.122:27017 database: local query: { find: "oplog.rs", filter: { ts: { $gte: Timestamp(1592780789, 1) } }, tailable: true, oplogReplay: true, awaitData: true, maxTimeMS: 60000, batchSize: 13981010, term: 25, readConcern: { afterClusterTime: Timestamp(1592780789, 1) } } query metadata: { $replData: 1, $oplogQueryData: 1, $readPreference: { mode: "secondaryPreferred" } } active: 1 findNetworkTimeout: 65000ms getMoreNetworkTimeout: 35000ms shutting down?: 0 first: 1 firstCommandScheduler: RemoteCommandRetryScheduler request: RemoteCommand 839538 -- target:192.168.26.122:27017 db:local cmd:{ find: "oplog.rs", filter: { ts: { $gte: Timestamp(1592780789, 1) } }, tailable: true, oplogReplay: true, awaitData: true, maxTimeMS: 60000, batchSize: 13981010, term: 25, readConcern: { afterClusterTime: Timestamp(1592780789, 1) } } active: 1 callbackHandle.valid: 1 callbackHandle.cancelled: 0 attempt: 1 retryPolicy: RetryPolicyImpl maxAttempts: 1 maxTimeMillis: -1ms
2020-06-22T01:10:13.019+0200 I ASIO [NetworkInterfaceASIO-RS-0] Successfully connected to 192.168.26.122:27017, took 33ms (2 connections now open to 192.168.26.122:27017)
2020-06-22T01:11:17.986+0200 I ASIO [NetworkInterfaceASIO-RS-0] Ending connection to host 192.168.26.122:27017 due to bad connection status; 1 connections to that host remain open
2020-06-22T01:11:17.986+0200 I REPL [replication-341] Error returned from oplog query (no more query restarts left): NetworkInterfaceExceededTimeLimit: error in fetcher batch callback: Operation timed out
2020-06-22T01:11:17.986+0200 I REPL [replication-341] Finished fetching oplog during initial sync: NetworkInterfaceExceededTimeLimit: error in fetcher batch callback: Operation timed out. Last fetched optime and hash: { ts: Timestamp(1592780789, 1), t: 25 }[6723760701176776417]
2020-06-22T01:11:31.818+0200 I REPL [replication-341] CollectionCloner ns:chainanalytics.rawdata_ETH_byhash finished cloning with status: IllegalOperation: AsyncResultsMerger killed
2020-06-22T01:11:35.200+0200 W REPL [replication-341] collection clone for 'chainanalytics.rawdata_ETH_byhash' failed due to IllegalOperation: While cloning collection 'chainanalytics.rawdata_ETH_byhash' there was an error 'AsyncResultsMerger killed'
2020-06-22T01:11:35.200+0200 I REPL [replication-341] CollectionCloner::start called, on ns:chainanalytics.rawdata_ICC_0x027b6094ac3DA754FCcC7C088BE04Ca155782A66
2020-06-22T01:11:35.200+0200 W REPL [replication-341] database 'chainanalytics' (2 of 4) clone failed due to ShutdownInProgress: collection cloner completed
2020-06-22T01:11:35.200+0200 I REPL [replication-341] Finished cloning data: ShutdownInProgress: collection cloner completed. Beginning oplog replay.
2020-06-22T01:11:35.200+0200 I REPL [replication-341] Initial sync attempt finishing up.
[...]
2020-06-22T01:11:35.291+0200 E REPL [replication-341] Initial sync attempt failed -- attempts left: 4 cause: NetworkInterfaceExceededTimeLimit: error fetching oplog during initial sync :: caused by :: error in fetcher batch callback:
Operation timed out
2020-06-22T01:11:35.332+0200 I NETWORK [thread12] Successfully connected to 192.168.26.122:27017 (212 connections now open to 192.168.26.122:27017 with a 0 second timeout)
2020-06-22T01:11:35.332+0200 I NETWORK [thread12] scoped connection to 192.168.26.122:27017 not being returned to the pool
2020-06-22T01:11:35.362+0200 I NETWORK [thread12] Starting new replica set monitor for rs0/192.168.10.145:27017,192.168.26.122:27017
2020-06-22T01:11:35.411+0200 I NETWORK [thread12] Successfully connected to 192.168.26.122:27017 (213 connections now open to 192.168.26.122:27017 with a 0 second timeout)
2020-06-22T01:11:35.411+0200 I NETWORK [thread12] scoped connection to 192.168.26.122:27017 not being returned to the pool
2020-06-22T01:11:35.411+0200 I NETWORK [thread12] Starting new replica set monitor for rs0/192.168.10.145:27017,192.168.26.122:27017
2020-06-22T01:11:35.460+0200 I NETWORK [thread12] Successfully connected to 192.168.26.122:27017 (214 connections now open to 192.168.26.122:27017 with a 0 second timeout)
2020-06-22T01:11:35.460+0200 I NETWORK [thread12] scoped connection to 192.168.26.122:27017 not being returned to the pool
2020-06-22T01:11:35.461+0200 I NETWORK [thread12] Starting new replica set monitor for rs0/192.168.10.145:27017,192.168.26.122:27017
2020-06-22T01:11:35.509+0200 I NETWORK [thread12] Successfully connected to 192.168.26.122:27017 (215 connections now open to 192.168.26.122:27017 with a 0 second timeout)
2020-06-22T01:11:35.509+0200 I NETWORK [thread12] scoped connection to 192.168.26.122:27017 not being returned to the pool
2020-06-22T01:11:35.533+0200 I NETWORK [thread12] Starting new replica set monitor for rs0/192.168.10.145:27017,192.168.26.122:27017
2020-06-22T01:11:35.583+0200 I NETWORK [thread12] Successfully connected to 192.168.26.122:27017 (216 connections now open to 192.168.26.122:27017 with a 0 second timeout)
2020-06-22T01:11:35.583+0200 I NETWORK [thread12] scoped connection to 192.168.26.122:27017 not being returned to the pool
2020-06-22T01:11:36.291+0200 I REPL [replication-342] Starting initial sync (attempt 7 of 10)
2020-06-22T01:11:36.293+0200 I STORAGE [replication-342] Finishing collection drop for local.temp_oplog_buffer (no UUID).
2020-06-22T01:11:36.366+0200 I STORAGE [replication-342] createCollection: local.temp_oplog_buffer with no UUID.
2020-06-22T01:11:36.438+0200 I REPL [replication-342] sync source candidate: 192.168.26.122:27017
2020-06-22T01:11:36.438+0200 I STORAGE [replication-342] dropAllDatabasesExceptLocal 3
2020-06-22T01:13:18.957+0200 I COMMAND [ftdc] serverStatus was very slow: { after basic: 0, after asserts: 0, after backgroundFlushing: 0, after connections: 0, after dur: 0, after extra_info: 0, after globalLock: 0, after locks: 0, after logicalSessionRecordCache: 0, after network: 0, after opLatencies: 0, after opcounters: 0, after opcountersRepl: 0, after repl: 0, after security: 0, after storageEngine: 0, after tcmalloc: 0, after transactions: 0, after wiredTiger: 101957, at end: 101957 }
任何想法?
如果发生此类错误,是否有可能避免所有数据被破坏,而是让复制过程在中断点重新启动?
答案1
Mongodb 4.4 版本对复制和初始同步过程进行了重大改进,包括非惩罚性重启。