SMART 表示我的服务器硬盘上有一个待处理扇区。我读过很多文章,推荐使用 hdparm 来“轻松”强制磁盘重新定位坏扇区,但我找不到正确的使用方法。
来自我的“smartctl”的一些信息:
Error 95 occurred at disk power-on lifetime: 20184 hours (841 days + 0 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 d7 55 dd 02 Error: UNC at LBA = 0x02dd55d7 = 48059863
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 08 d6 55 dd e2 00 18d+05:13:42.421 READ DMA
27 00 00 00 00 00 e0 00 18d+05:13:42.392 READ NATIVE MAX ADDRESS EXT
ec 00 00 00 00 00 a0 02 18d+05:13:42.378 IDENTIFY DEVICE
ef 03 46 00 00 00 a0 02 18d+05:13:42.355 SET FEATURES [Set transfer mode]
27 00 00 00 00 00 e0 00 18d+05:13:42.327 READ NATIVE MAX ADDRESS EXT
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed: read failure 90% 20194 48059863
# 2 Short offline Completed without error 00% 15161 -
有了那个“坏的 LBA”(48059863),如何使用 hdparm?参数“--read-sector”和“--write-sector”应该具有什么类型的地址?
如果我发出命令hdparm --读取扇区 48095863 /dev/sda它读取并转储数据。如果此命令正确,我应该预料到会出现 I/O 错误,对吗?
相反,它会转储数据:
$ ./hdparm --read-sector 48059863 /dev/sda
/dev/sda:
reading sector 48059863: succeeded
4b50 5d1b 7563 a932 618d 1f81 4514 2343
8a16 3342 5e36 2591 3b4e 762a 4dd7 037f
6a32 6996 816f 573f eee1 bc24 eed4 206e
(...)
答案1
如果您出于某种原因想要尝试清除这些坏扇区,并且您不关心驱动器的现有内容,则下面的 shell 代码片段可能会有所帮助。我在一个旧的 Seagate Barracuda 驱动器上测试了这一点,该驱动器早已过了保修期。它可能无法与其他驱动器型号或制造商配合使用,但如果您必须编写一些脚本。它将要销毁驱动器上的所有内容。
您可能更喜欢只运行 badblocks,hdparm 安全擦除 (SE) (https://wiki.archlinux.org/index.php/Securely_wipe_disk),或者其他专门为此设计的工具。或者甚至是制造商提供的工具,如 SeaTools(有一个 32 位 Linux“企业”版本,谷歌一下)。
在执行此操作之前,请确保有问题的驱动器完全未使用/卸载。另外,我知道,while 循环,没有借口。这是一个 hack,你可以做得更好...
baddrive=/dev/sdb
badsect=1
while true; do
echo Testing from LBA $badsect
smartctl -t select,${badsect}-max ${baddrive} 2>&1 >> /dev/null
echo "Waiting for test to stop (each dot is 5 sec)"
while [ "$(smartctl -l selective ${baddrive} | awk '/^ *1/{print substr($4,1,9)}')" != "Completed" ]; do
echo -n .
sleep 5
done
echo
badsect=$(smartctl -l selective ${baddrive} | awk '/# 1 Selective offline Completed: read failure/ {print $10}')
[ $badsect = "-" ] && exit 0
echo Attempting to fix sector $badsect on $baddrive
hdparm --repair-sector ${badsect} --yes-i-know-what-i-am-doing $baddrive
echo Continuning test
done
使用“自检”方法的一个优点是负载由驱动器固件处理,因此它所连接的 PC 不会像 dd 或坏块那样被负载压垮。
注意:抱歉,我犯了一个错误,正确的 while 条件是这样的:
while [ "$(smartctl -l selective ${baddrive} | awk '/^ *1/{print $4}')" = "Self_test_in_progess" ]; do
脚本的退出条件变成:
[ $badsect = "-" ] || [ "$badsect" = "" ] && exit 0
答案2
我认为它可能读取时没有错误,因为该扇区没有坏,但是其他工具由于其他行为而无法读取该扇区。(预读到达实际上无法读取的扇区?)
我发现了一些坏扇区,如果我使用“hdparm --read-sector”修复唯一无法读取的扇区,其他“坏”扇区就会突然不再无法读取,例如使用 dd。有趣的是,当查看“dmesg”输出时,只会报告 hdparm 无法读取的扇区。
例如。我的扇区 36589320 到 36589327 和 36589344 到 36589351 无法用 dd 读取,但只有 36589326 和 36589345 无法用 hdparm --read-sector 读取。然后我对这两个扇区使用了 hdparm --write-sector,然后所有 16 个扇区都再次可读。
以下是 dmesg 输出的一小部分:
[30152036.527940] end_request: I/O error, dev sda, sector 36589326
[30152077.363710] end_request: I/O error, dev sda, sector 36589345
磁盘信息:
# smartctl -i /dev/sda
...
=== START OF INFORMATION SECTION ===
Device Model: TOSHIBA MK2002TSKB
...
Firmware Version: MT2A
User Capacity: 2,000,398,934,016 bytes [2.00 TB]
Sector Size: 512 bytes logical/physical
...
显然,该磁盘的固件要么没有正确记录重新分配的扇区,要么它们实际上并没有重新分配,而只是损坏了(例如不可恢复的 ECC 错误,但表面仍然有效,就像是由位腐烂而不是电子设备故障或媒体损坏引起的):
# smartctl -A /dev/sda | egrep "Reallocated|Pending|Uncorrectable"
5 Reallocated_Sector_Ct 0x0033 100 100 050 Pre-fail Always - 0
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
# smartctl -l error /dev/sda
...
SMART Error Log Version: 1
No Errors Logged
请注意,我运行了 --read-sector 和 --write-sector。可能需要读取才能正确重新分配扇区,而不仅仅是写入。如果您不先读取,它可能不知道该扇区是坏的。
答案3
根据@Glenn 的回答,你会发现脚本修复错误
http://wiki.bitplan.com/index.php/Bad_Block_Howto
截至 2020-09-10,脚本内容如下:
#!/bin/bash
# see http://wiki.bitplan.com/index.php/Bad_Block_Howto
# see https://github.com/hradec/fix_smart_last_bad_sector/blob/master/fix_smart_last_bad_sector.sh
# see https://www.thomas-krenn.com/de/wiki/Analyse_einer_fehlerhaften_Festplatte_mit_smartctl
# WF 2020-10-04
disk=/dev/sdb
mode=short
# verbose
verbose=false
# should commands only be shown?
dry=false
# should write fixes be performed?
fix=false
# range of sectors to modify after bad sector
range=8
# set to sudo if sudo is needed
sudo=sudo
# serial number
serial="-?-"
#ansi colors
#http://www.csc.uvic.ca/~sae/seng265/fall04/tips/s265s047-tips/bash-using-colors.html
blue='\033[0;34m'
red='\033[0;31m'
green='\033[0;32m' # '\e[1;32m' is too bright for white bg.
endColor='\033[0m'
#
# a colored message
# params:
# 1: l_color - the color of the message
# 2: l_msg - the message to display
#
color_msg() {
local l_color="$1"
local l_msg="$2"
echo -e "${l_color}$l_msg${endColor}"
}
#
# error
#
# show an error message and exit
#
# params:
# 1: l_msg - the message to display
error() {
local l_msg="$1"
# use ansi red for error
color_msg $red "Error: $l_msg" 1>&2
exit 1
}
#
# show the usage
#
usage() {
echo "usage: $0 [disk]"
echo " [-c|--check]"
echo " [-d|--dry]"
echo " [-h|--help]"
echo " [-i|--info]"
echo " [[-m|--mode] mode]"
echo " [[-r|--range] range]"
echo " [[-s|--serial [serial]]"
echo " [-t|--test]"
echo " [[-w|--wait [type]]"
echo " [-v|--verbose]"
echo
echo " -h|--help: show this usage"
echo " -c|--check: check the disk"
echo " -d|--dry: dry run - show commands only"
echo " -i|--info: show info about the given disk"
echo " -m|--mode: set mode: default=short"
echo " -r|--range: range of sectors to modify after bad sector"
echo " -s|--serial: get serial number of confirm serial number"
echo " -t|--test: run test for the given type e.g. selective selftest"
echo " -w|--wait: wait for the result of the given testype e.g. selective selftest"
echo " -v|--verbose: set verbose mode"
echo ""
echo "example:"
echo " $0 /dev/sdb -i"
echo ""
echo "for any write operation you need to confirm the serial number"
echo "to get serial number: "
echo " $0 disk -s "
exit 1
}
#
# get a number range from 0 to the given n-1
#
# params
# 1: n
function getRange() {
local l_n="$1"
range=$(python -c "for i in range($l_n): print i,")
echo $range
}
#
# read the result of the smartctl test for the given disk
#
# params
# 1: l_disk: the disk under test e.g. /dev/sdb
# 2: l_type: the type of the test e.g. selective
function readResult() {
local l_disk="$1"
local l_type="$2"
$sudo smartctl -l $l_type $l_disk | egrep "^#?[[:space:]]*[0-9]"
}
#
# show the Result
#
function showResult() {
local l_logline="$1"
local l_logstatus="$2"
if [ "$verbose" == "true" ]
then
echo $l_logstatus:$l_logline
else
echo $l_logline | gawk '
/#/ {
print $0; exit
}
{
status=substr($4,1,9)
progress=$5;
gsub("\\[","",progress);
range=$7
printf("\r%s",progress);
}'
fi
}
#
# wait for the result of a running selftest
#
# param 1: l_disk: the disk under test e.g. /dev/sdb
# param 2: l_type: the type of the test e.g. selective
# param 3: l_wait: number of seconds to wait
#
function waitForResult() {
# example
#=== START OF READ SMART DATA SECTION ===
#SMART Selective self-test log data structure revision number 1
#SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
# 1 7814037167 Self_test_in_progress [90% left] (2564632-2630167)
local l_disk="$1"
local l_type="$2"
local l_wait="$3"
local l_logline=""
local l_logstatus=""
color_msg $blue "Waiting for $l_type test of $l_disk to stop (each dot is $l_wait sec)"
while [ "$l_logstatus" != "Completed" ]; do
l_logline=$(readResult "$l_disk" "$l_type" | egrep "^#?[[:space:]]*1")
l_logstatus=$(echo $l_logline | gawk ' /Completed/ { print "Completed"; }')
showResult "$l_logline" "$l_logstatus"
sleep $l_wait
done
}
#
# get the serial number of the device
#
function getSerialNumber() {
local l_disk="$1"
serial=$($sudo smartctl -i $l_disk | grep "Serial Number" | cut -f 2 -d':')
echo $serial
}
#
# get the blocksize of the given file system
#
function getBlockSize() {
local l_fs="$1"
blocksize=$($sudo tune2fs -l $l_fs | grep "Block size:" | cut -f2 -d':')
echo $blocksize
}
#
# get the partition for the given disk
#
function getPartition() {
local l_disk="$1"
fs=$(mount | grep $l_disk | cut -f1 -d' ')
echo $fs
}
#
# get the start sector for the given disk
#
function getStartSector() {
local l_disk="$1"
local l_fs="$2"
startsector=$($sudo fdisk -l $l_disk | grep $l_fs | cut -f4 -d' ')
echo $startsector
}
#
# get Info about the given disk
#
function getInfo() {
local l_disk="$1"
$sudo smartctl -i $l_disk | egrep "(Model|Serial|Rotation|Sector|Capacity)"
$sudo hdparm -I $l_disk | egrep "(Serial Number|Model)"
fs=$(getPartition $l_disk)
if [ "$fs" != "" ]
then
color_msg $blue "Partition: $fs"
blocksize=$(getBlockSize $fs)
color_msg $blue "Blocksize: $blocksize"
else
color_msg $red "couldn't find mounted partition for $l_disk"
fi
}
#
# geh the current pending sector for the given disk
#
function getCurrentPendingSector() {
local l_disk="$1"
# if msg is empty don't show message but only return the current pending sector count
local l_msg="$2"
psectorline=$($sudo smartctl -A $l_disk | grep Current_Pending_Sector)
psector=0
if [ $? -eq 0 ]
then
if [ "$l_msg" != "" ]; then color_msg $green "$psectorline"; fi
psector=$(echo $psectorline | cut -f 10 -d ' ')
if [ $psector -gt 0 ]
then
if [ "$l_msg" != "" ]; then color_msg $red "Current_Pending_Sector is not zero but $psector"; fi
else
if [ "$l_msg" != "" ]; then color_msg $green "Current_Pending_Sector is zero!"; fi
fi
else
if [ "$l_msg" != "" ]; then color_msg $red "smartctl -A did not output Current_Pending_Sector"; fi
psector=-1
fi
if [ "$l_msg" == "" ]; then echo $psector; fi
}
#
# fix the given bad sector on the given disk with the given range of sectors to fix
#
# param 1: disk e.g. /dev/sdb1
# param 2: defect sector to repair
# param 3: range - range of sectors to repair e.g. 8
#
fixBad() {
local l_disk="$1"
local l_sector="$2"
local l_range="$3"
color_msg $blue "repairing sector $l_sector to $l_sector+$l_range on $l_disk ..."
r=$(getRange $l_range)
for i in $r ; do
let b1=$l_sector+$i
if [ "$dry" == "true" ]
then
echo hdparm --repair-sector $b1 --yes-i-know-what-i-am-doing $l_disk
else
$sudo hdparm --repair-sector $b1 --yes-i-know-what-i-am-doing $disk >> /tmp/smart_repaired.log
fi
done
#tail -n 60 /tmp/smart_repaired.log | grep writing | tail -n 20
#grep '#' /tmp/smart | head -5
#hdparm -I $disk > /tmp/hdparm
}
#
# check the needed software
#
checkSoftware() {
for sw in gawk debugfs fdisk hdparm smartctl tune2fs python $sudo
do
bin=$(which $sw)
if [ $? -eq 0 ]
then
if [ "$verbose" == "true" ]
then
color_msg $green "will use $bin as $sw"
fi
else
error "$0 needs $sw to work please install it"
fi
done
}
#
# run a test for the given disk in the given mode
#
# params
# 1: l_disk: the disk under test e.g. /dev/sdb
# 2: l_mode: the mode of the self test e.g. short/long
function runTest() {
local l_disk="$1"
local l_mode="$2"
color_msg $blue "running $l_mode smartctl test for $l_disk ..."
$sudo smartctl -t $l_mode $l_disk > /tmp/null
}
#
# check the given disk in the given mode
#
function checkDisk() {
local l_disk="$1"
local l_mode="$2"
local l_serial="$3"
fs=$(getPartition $l_disk)
blocksize=$(getBlockSize $fs)
startsector=$(getStartSector $l_disk $fs)
color_msg $blue "checking Current_Pending_Sector count for $l_disk partition $fs blocksize $blocksize startsector $startsector"
getCurrentPendingSector "$l_disk" show
psector=$(getCurrentPendingSector "$l_disk")
if [ $psector -gt 0 ]
then
runTest $l_disk $l_mode
fi
}
#
# check the lba block
#
function lbaCheck() {
local l_disk="$1"
fs=$(getPartition $l_disk)
blocksize=$(getBlockSize $fs)
startsector=$(getStartSector $l_disk $fs)
diskserial=$(getSerialNumber $l_disk)
readResult "$l_disk" selftest | while read line
do
echo $line | grep "read failure" > /dev/null
if [ $? -eq 0 ]
then
if [ "$verbose" == "true" ]
then
echo $line
fi
index=$(echo $line | cut -f2 -d' ')
state=$(echo $line | cut -f3-4 -d ' ')
progress=$(echo $line | cut -f8 -d ' ')
lba=$(echo $line | cut -f10 -d ' ')
if [ "$lba" == "" ]
then
lba=0
fi
if [ "$lba" -gt 0 ]
then
echo $index $state
echo "progress: $progress"
echo "lba: $lba"
# calculate the file system block
fsb=$(gawk -v L=$lba -v S=$startsector -v B=$blocksize 'BEGIN {printf ("%.0f",((L-S)*512/B))}')
echo "file system block: $fsb"
if [ "$fix" == "true" ]
then
if [ "$serial" != "$diskserial" ]
then
color_msg $red "you need to provide the serial number of $l_disk to perform fix operations"
else
fixBad $l_disk $lba $range
fi
fi
fi
fi
done
}
#
# try Fixing bad sectors
#
function tryFix() {
local l_disk="$1"
badsect=$($sudo smartctl -l selective ${baddrive} | gawk '/# 1 Selective offline Completed: read failure/ {print $10}')
[ $badsect = "-" ] && exit 0
echo Attempting to fix sector $badsect on $baddrive
echo hdparm --repair-sector ${badsect} --yes-i-know-what-i-am-doing $baddrive
}
#
# start a check loop on the given drive
#
function checkLoop() {
local baddrive="$1"
badsect=1
while true; do
color_msg $blue "Testing $baddrive from LBA $badsect"
$sudo smartctl -t select,${badsect}-max ${baddrive} 2>&1 >> /dev/null
waitForResult $baddrive selective 5
tryFix $baddrive
color_msg $blue "running next test"
done
}
# make sure the needed software is available
checkSoftware
# commandline option
while [ "$1" != "" ]
do
option=$1
shift
case $option in
-h|--help)
usage
;;
-i|--info)
getInfo $disk
;;
-m|--mode)
if [ $# -lt 1 ]
then
usage
else
mode=$1
shift
fi
;;
-c|--check)
checkDisk $disk $mode $serial
;;
-d|--dry)
dry=true
;;
-l|--loop)
checkLoop $disk
;;
-f|--fix)
fix=true
;;
-r|--range)
if [ $# -lt 1 ]
then
usage
else
range=$1
shift
fi
;;
-s|--serial)
if [ $# -lt 1 ]
then
getSerialNumber $disk
exit 1
else
serial=$1
shift
fi
;;
-t|--test)
runTest $disk $mode
;;
-v|--verbose)
verbose=true
;;
-w|--wait)
if [ $# -lt 1 ]
then
usage
else
type=$1
shift
waitForResult $disk $type 5
fi
;;
-x)
lbaCheck $disk $serial;;
*)
disk=$option
;;
esac
done
就我个人而言,我无法使用此工具包获得任何有意义的结果,即“修复”磁盘。但脚本及其部分对于分析和尝试修复仍然很有帮助。请谨慎使用该脚本,以期实现“完全自动化”。您可能会丢失数据,而不是修复数据。