我在我的服务器机器上使用 Ubuntu 18.04。一个星期一次Ubuntu 的块校验和无效,因此 Ubuntu 进入只读模式,之后进程失败,需要重新启动 Ubuntu。在启动过程中,日志显示无效校验和正在恢复块,在我执行 fsck 命令后,它恢复了。这对我来说完全是崩溃的,因为网站无法正常工作,直到我参与并手动执行 st.. 我使用的相关堆栈:mongod 用于数据库,pyton 带有用于网页抓取的分叉多进程。
这个错误发生在 python webscraping 作业运行时,要么在它完成之前有一点时间差,要么在它完成之后立即出现。Mongod 日志显示一些连接结束,然后它自行重新启动并无法获取端口 27017,因为它需要更改一些文件,所以它失败了。这个连接结束有时看起来不错:
2020-07-09T05:08:29.477+0200 I NETWORK [conn399] end connection 127.0.0.1:46798 (10 connections now open)
2020-07-09T05:08:29.477+0200 I NETWORK [conn398] end connection 127.0.0.1:46796 (9 connections now open)
2020-07-09T05:53:36.207+0200 I CONTROL [main] ***** SERVER RESTARTED *****
2020-07-09T05:53:36.244+0200 I CONTROL [main] Automatically disabling TLS 1.0, to force-enable TLS 1.0 specify --sslDisabledPr$
2020-07-09T05:53:36.380+0200 I CONTROL [initandlisten] MongoDB starting : pid=963 port=27017 dbpath=/var/lib/mongodb 64-bit ho$
2020-07-09T05:53:36.380+0200 I CONTROL [initandlisten] db version v4.0.10
有时则不然(注意 ^@ 字符)
2020-07-04T22:42:57.252+0200 I NETWORK [conn602] end connection 127.0.0.1:39808 (14 connections now open)
2020-07-04T22:42:57.252+0200 I NETWORK [conn601] end connection 127.0.0.1:398^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^$
2020-07-05T00:53:30.611+0200 I CONTROL [main] Automatically disabling TLS 1.0, to force-enable TLS 1.0 specify --sslDisabledProtocols 'none'
2020-07-05T00:53:30.703+0200 I CONTROL [initandlisten] MongoDB starting : pid=930 port=27017 dbpath=/var/lib/mongodb 64-bit host=mojtiketServerComputerName
2020-07-05T00:53:30.703+0200 I CONTROL [initandlisten] db version v4.0.10
另一个可能的罪魁祸首(虽然可能性较小)是 python 使用 fork 方法产生新进程,每个进程都有自己的文件来写入发生的错误。
有一次我在 ubuntu 处于只读模式时在终端中运行 fsck,但没有重启操作系统,mongod 无法恢复,只能重新安装。我已经向 mongodb 团队报告了这个错误。下面的几行就是这次执行的结果。
你能帮我一下如何分类处理这个问题,如何缓解这个问题等等吗?
e2fsck 1.44.1 (24-Mar-2018)
/dev/sda2: recovering journal
JBD2: Invalid checksum recovering block 95420432 in log
JBD2: Invalid checksum recovering block 46 in log
JBD2: Invalid checksum recovering block 95420455 in log
JBD2: Invalid checksum recovering block 95420456 in log
JBD2: Invalid checksum recovering block 95945036 in log
JBD2: Invalid checksum recovering block 95420432 in log
myUserName@myComputerName:~$ sudo fsck -f /
fsck from util-linux 2.31.1
e2fsck 1.44.1 (24-Mar-2018)
Pass 1: Checking inodes, blocks, and sizes
Inode 23855152, end of extent exceeds allowed value
(logical block 83, physical block 95588851, len 5)
Clear<y>? yes
Inode 23855152, i_blocks is 688, should be 672. Fix<y>? yes
Inode 23855358, i_blocks is 1968, should be 1960. Fix<y>? yes
Inode 23855668, i_blocks is 5712, should be 5704. Fix<y>? yes
Inode 23865002, end of extent exceeds allowed value
(logical block 6, physical block 47764987, len 3)
Clear<y>? yes
Inodes that were part of a corrupted orphan linked list found. Fix<y>? yes
Inode 29097997 was part of the orphaned inode list. FIXED.
Deleted inode 29098009 has zero dtime. Fix<y>? yes
Pass 2: Checking directory structure
Entry 'job.cache.N' in /var/cache/cups (23855431) has deleted/unused inode 23855386. Clear<y>? yes
Entry 'metrics.interim' in /var/lib/mongodb/diagnostic.data (23865008) has deleted/unused inode 23855125. Clear<y>? yes
Pass 3: Checking directory connectivity
Unconnected directory inode 13238821 (/home/mtp/project/prod/webscraper/Logging/???)
Connect to /lost+found ('a' enables 'yes' to all) <y>? yes to all
Pass 4: Checking reference counts
Inode 12191776 ref count is 1305, should be 1306. Fix? yes
Unattached inode 13238738
Connect to /lost+found? yes
Inode 13238738 ref count is 2, should be 1. Fix? yes
Unattached inode 13238755
Connect to /lost+found? yes
Inode 13238755 ref count is 2, should be 1. Fix? yes
Unattached inode 13238788
Connect to /lost+found? yes
Inode 13238788 ref count is 2, should be 1. Fix? yes
Unattached inode 13238812
Connect to /lost+found? yes
Inode 13238812 ref count is 2, should be 1. Fix? yes
Inode 13238821 ref count is 5, should be 4. Fix? yes
Unattached inode 13238825
Connect to /lost+found? yes
Inode 13238825 ref count is 2, should be 1. Fix? yes
Unattached inode 13238826
Connect to /lost+found? yes
Inode 13238826 ref count is 2, should be 1. Fix? yes
Unattached inode 13238827
Connect to /lost+found? yes
Inode 13238827 ref count is 2, should be 1. Fix? yes
Unattached inode 13238828
Connect to /lost+found? yes
Inode 13238828 ref count is 2, should be 1. Fix? yes
Unattached inode 23855247
Connect to /lost+found? yes
Inode 23855247 ref count is 2, should be 1. Fix? yes
Inode 23855279 ref count is 1, should be 2. Fix? yes
Inode 23855334 ref count is 1, should be 2. Fix? yes
Pass 5: Checking group summary information
Block bitmap differences: -2060872 +(47755985--47755986) +(47755992--47756020) +47756082 -47756083 -47764445 -(47765295--47765296) -(47765515--47765528) -(47765575--47765589) +(52961514--52961515) -(52986152--52986154) +(75544092--75544097) -(95453700--95453703) -(95588851--95588855) -(95621149--95621161) +(95651808--95651825) +(95662080--95663552) -(95664128--95665599) -(95666112--95666143) -95713396 +(95775716--95775718)
Fix? yes
Free blocks count wrong for group #43 (17234, counted=17235).
Fix? yes
Free blocks count wrong for group #1457 (31366, counted=31367).
Fix? yes
Free blocks count wrong for group #1617 (32410, counted=32413).
Fix? yes
Free blocks count wrong for group #2305 (29062, counted=29056).
Fix? yes
Free blocks count wrong for group #2913 (14362, counted=14366).
Fix? yes
Free blocks count wrong for group #2917 (15501, counted=15506).
Fix? yes
Free blocks count wrong for group #2918 (11211, counted=11224).
Fix? yes
Free blocks count wrong for group #2920 (6686, counted=6687).
Fix? yes
Free blocks count wrong (112183187, counted=112183209).
Fix? yes
Inode bitmap differences: +1312455 -1312466 -23855125 +23855247 -23855386 -29097997 -29098009
Fix? yes
Free inodes count wrong for group #160 (5233, counted=5234).
Fix? yes
Free inodes count wrong for group #2912 (1772, counted=1774).
Fix? yes
Free inodes count wrong for group #3552 (8162, counted=8164).
Fix? yes
Free inodes count wrong (29020534, counted=29020538).
Fix? yes
/dev/sda2: ***** FILE SYSTEM WAS MODIFIED *****
/dev/sda2: ***** REBOOT SYSTEM *****
/dev/sda2: 257670/29278208 files (0.2% non-contiguous), 4898135/117081344 blocks
smartctl 多次运行均无错误结束:
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-5.3.0-53-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Device Model: KINGSTON SA400S37480G
Serial Number: 50026B7682672C22
LU WWN Device Id: 5 0026b7 682672c22
Firmware Version: SBFKB1C2
User Capacity: 480,103,981,056 bytes [480 GB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ACS-3 T13/2161-D revision 4
SATA Version is: SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Fri Jul 10 05:29:42 2020 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (65535) seconds.
Offline data collection
capabilities: (0x11) SMART execute Offline immediate.
No Auto Offline data collection support.
Suspend Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
No Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 30) minutes.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x0032 000 100 000 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 1163
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 25
148 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 0
149 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 0
167 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 0
168 Unknown_Attribute 0x0012 100 100 000 Old_age Always - 0
169 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 14
170 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 10
172 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0
173 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 131087
181 Program_Fail_Cnt_Total 0x0032 100 100 000 Old_age Always - 0
182 Erase_Fail_Count_Total 0x0000 100 100 000 Old_age Offline - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0012 100 100 000 Old_age Always - 14
194 Temperature_Celsius 0x0022 069 066 000 Old_age Always - 31 (Min/Max 22/34)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
218 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 0
231 Temperature_Celsius 0x0000 001 001 000 Old_age Offline - 99
233 Media_Wearout_Indicator 0x0032 100 100 000 Old_age Always - 586
241 Total_LBAs_Written 0x0032 100 100 000 Old_age Always - 108
242 Total_LBAs_Read 0x0032 100 100 000 Old_age Always - 36
244 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 2
245 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 15
246 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 38240
246 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 38240
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 1050 -
# 2 Extended offline Completed without error 00% 1043 -
# 3 Short offline Completed without error 00% 1042 -
Selective Self-tests/Logging not supported