Ubuntu 每周都会在区块中出现一次无效校验和

Ubuntu 每周都会在区块中出现一次无效校验和

我在我的服务器机器上使用 Ubuntu 18.04。一个星期一次Ubuntu 的块校验和无效,因此 Ubuntu 进入只读模式,之后进程失败,需要重新启动 Ubuntu。在启动过程中,日志显示无效校验和正在恢复块,在我执行 fsck 命令后,它恢复了。这对我来说完全是崩溃的,因为网站无法正常工作,直到我参与并手动执行 st.. 我使用的相关堆栈:mongod 用于数据库,pyton 带有用于网页抓取的分叉多进程。

这个错误发生在 python webscraping 作业运行时,要么在它完成之前有一点时间差,要么在它完成之后立即出现。Mongod 日志显示一些连接结束,然后它自行重新启动并无法获取端口 27017,因为它需要更改一些文件,所以它失败了。这个连接结束有时看起来不错:

2020-07-09T05:08:29.477+0200 I NETWORK  [conn399] end connection 127.0.0.1:46798 (10 connections now open)
2020-07-09T05:08:29.477+0200 I NETWORK  [conn398] end connection 127.0.0.1:46796 (9 connections now open)
2020-07-09T05:53:36.207+0200 I CONTROL  [main] ***** SERVER RESTARTED *****
2020-07-09T05:53:36.244+0200 I CONTROL  [main] Automatically disabling TLS 1.0, to force-enable TLS 1.0 specify --sslDisabledPr$
2020-07-09T05:53:36.380+0200 I CONTROL  [initandlisten] MongoDB starting : pid=963 port=27017 dbpath=/var/lib/mongodb 64-bit ho$
2020-07-09T05:53:36.380+0200 I CONTROL  [initandlisten] db version v4.0.10

有时则不然(注意 ^@ 字符)

2020-07-04T22:42:57.252+0200 I NETWORK  [conn602] end connection 127.0.0.1:39808 (14 connections now open)
2020-07-04T22:42:57.252+0200 I NETWORK  [conn601] end connection 127.0.0.1:398^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^$
2020-07-05T00:53:30.611+0200 I CONTROL  [main] Automatically disabling TLS 1.0, to force-enable TLS 1.0 specify --sslDisabledProtocols 'none'
2020-07-05T00:53:30.703+0200 I CONTROL  [initandlisten] MongoDB starting : pid=930 port=27017 dbpath=/var/lib/mongodb 64-bit host=mojtiketServerComputerName
2020-07-05T00:53:30.703+0200 I CONTROL  [initandlisten] db version v4.0.10

另一个可能的罪魁祸首(虽然可能性较小)是 python 使用 fork 方法产生新进程,每个进程都有自己的文件来写入发生的错误。

有一次我在 ubuntu 处于只读模式时在终端中运行 fsck,但没有重启操作系统,mongod 无法恢复,只能重新安装。我已经向 mongodb 团队报告了这个错误。下面的几行就是这次执行的结果。

你能帮我一下如何分类处理这个问题,如何缓解这个问题等等吗?

e2fsck 1.44.1 (24-Mar-2018)
/dev/sda2: recovering journal
JBD2: Invalid checksum recovering block 95420432 in log
JBD2: Invalid checksum recovering block 46 in log
JBD2: Invalid checksum recovering block 95420455 in log
JBD2: Invalid checksum recovering block 95420456 in log
JBD2: Invalid checksum recovering block 95945036 in log
JBD2: Invalid checksum recovering block 95420432 in log

myUserName@myComputerName:~$ sudo fsck -f /
fsck from util-linux 2.31.1
e2fsck 1.44.1 (24-Mar-2018)
Pass 1: Checking inodes, blocks, and sizes
Inode 23855152, end of extent exceeds allowed value
        (logical block 83, physical block 95588851, len 5)
Clear<y>? yes
Inode 23855152, i_blocks is 688, should be 672.  Fix<y>? yes
Inode 23855358, i_blocks is 1968, should be 1960.  Fix<y>? yes
Inode 23855668, i_blocks is 5712, should be 5704.  Fix<y>? yes
Inode 23865002, end of extent exceeds allowed value
        (logical block 6, physical block 47764987, len 3)
Clear<y>? yes
Inodes that were part of a corrupted orphan linked list found.  Fix<y>? yes
Inode 29097997 was part of the orphaned inode list.  FIXED.
Deleted inode 29098009 has zero dtime.  Fix<y>? yes
Pass 2: Checking directory structure
Entry 'job.cache.N' in /var/cache/cups (23855431) has deleted/unused inode 23855386.  Clear<y>? yes
Entry 'metrics.interim' in /var/lib/mongodb/diagnostic.data (23865008) has deleted/unused inode 23855125.  Clear<y>? yes
Pass 3: Checking directory connectivity
Unconnected directory inode 13238821 (/home/mtp/project/prod/webscraper/Logging/???)
Connect to /lost+found ('a' enables 'yes' to all) <y>? yes to all
Pass 4: Checking reference counts
Inode 12191776 ref count is 1305, should be 1306.  Fix? yes

Unattached inode 13238738
Connect to /lost+found? yes

Inode 13238738 ref count is 2, should be 1.  Fix? yes

Unattached inode 13238755
Connect to /lost+found? yes

Inode 13238755 ref count is 2, should be 1.  Fix? yes

Unattached inode 13238788
Connect to /lost+found? yes

Inode 13238788 ref count is 2, should be 1.  Fix? yes

Unattached inode 13238812
Connect to /lost+found? yes

Inode 13238812 ref count is 2, should be 1.  Fix? yes

Inode 13238821 ref count is 5, should be 4.  Fix? yes

Unattached inode 13238825
Connect to /lost+found? yes

Inode 13238825 ref count is 2, should be 1.  Fix? yes

Unattached inode 13238826
Connect to /lost+found? yes

Inode 13238826 ref count is 2, should be 1.  Fix? yes

Unattached inode 13238827
Connect to /lost+found? yes

Inode 13238827 ref count is 2, should be 1.  Fix? yes

Unattached inode 13238828
Connect to /lost+found? yes

Inode 13238828 ref count is 2, should be 1.  Fix? yes

Unattached inode 23855247
Connect to /lost+found? yes

Inode 23855247 ref count is 2, should be 1.  Fix? yes

Inode 23855279 ref count is 1, should be 2.  Fix? yes

Inode 23855334 ref count is 1, should be 2.  Fix? yes

Pass 5: Checking group summary information
Block bitmap differences:  -2060872 +(47755985--47755986) +(47755992--47756020) +47756082 -47756083 -47764445 -(47765295--47765296) -(47765515--47765528) -(47765575--47765589) +(52961514--52961515) -(52986152--52986154) +(75544092--75544097) -(95453700--95453703) -(95588851--95588855) -(95621149--95621161) +(95651808--95651825) +(95662080--95663552) -(95664128--95665599) -(95666112--95666143) -95713396 +(95775716--95775718)
Fix? yes

Free blocks count wrong for group #43 (17234, counted=17235).
Fix? yes

Free blocks count wrong for group #1457 (31366, counted=31367).
Fix? yes

Free blocks count wrong for group #1617 (32410, counted=32413).
Fix? yes

Free blocks count wrong for group #2305 (29062, counted=29056).
Fix? yes

Free blocks count wrong for group #2913 (14362, counted=14366).
Fix? yes

Free blocks count wrong for group #2917 (15501, counted=15506).
Fix? yes

Free blocks count wrong for group #2918 (11211, counted=11224).
Fix? yes

Free blocks count wrong for group #2920 (6686, counted=6687).
Fix? yes

Free blocks count wrong (112183187, counted=112183209).
Fix? yes

Inode bitmap differences:  +1312455 -1312466 -23855125 +23855247 -23855386 -29097997 -29098009
Fix? yes

Free inodes count wrong for group #160 (5233, counted=5234).
Fix? yes

Free inodes count wrong for group #2912 (1772, counted=1774).
Fix? yes

Free inodes count wrong for group #3552 (8162, counted=8164).
Fix? yes

Free inodes count wrong (29020534, counted=29020538).
Fix? yes


/dev/sda2: ***** FILE SYSTEM WAS MODIFIED *****
/dev/sda2: ***** REBOOT SYSTEM *****
/dev/sda2: 257670/29278208 files (0.2% non-contiguous), 4898135/117081344 blocks

smartctl 多次运行均无错误结束:

smartctl 6.6 2016-05-31 r4324 [x86_64-linux-5.3.0-53-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     KINGSTON SA400S37480G
Serial Number:    50026B7682672C22
LU WWN Device Id: 5 0026b7 682672c22
Firmware Version: SBFKB1C2
User Capacity:    480,103,981,056 bytes [480 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-3 T13/2161-D revision 4
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Fri Jul 10 05:29:42 2020 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (65535) seconds.
Offline data collection
capabilities:                    (0x11) SMART execute Offline immediate.
                                        No Auto Offline data collection support.
                                        Suspend Offline collection upon new
                                        command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        No Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (  30) minutes.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x0032   000   100   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       1163
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       25
148 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       0
149 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       0
167 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       0
168 Unknown_Attribute       0x0012   100   100   000    Old_age   Always       -       0
169 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       14
170 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       10
172 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
173 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       131087
181 Program_Fail_Cnt_Total  0x0032   100   100   000    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0000   100   100   000    Old_age   Offline      -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0012   100   100   000    Old_age   Always       -       14
194 Temperature_Celsius     0x0022   069   066   000    Old_age   Always       -       31 (Min/Max 22/34)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
199 UDMA_CRC_Error_Count    0x0032   100   100   000    Old_age   Always       -       0
218 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
231 Temperature_Celsius     0x0000   001   001   000    Old_age   Offline      -       99
233 Media_Wearout_Indicator 0x0032   100   100   000    Old_age   Always       -       586
241 Total_LBAs_Written      0x0032   100   100   000    Old_age   Always       -       108
242 Total_LBAs_Read         0x0032   100   100   000    Old_age   Always       -       36
244 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       2
245 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       15
246 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       38240
246 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       38240

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      1050         -
# 2  Extended offline    Completed without error       00%      1043         -
# 3  Short offline       Completed without error       00%      1042         -

Selective Self-tests/Logging not supported

相关内容