如何使用 hdparm 修复待处理的扇区?

如何使用 hdparm 修复待处理的扇区?

SMART 表示我的服务器硬盘上有一个待处理扇区。我读过很多文章,推荐使用 hdparm 来“轻松”强制磁盘重新定位坏扇区,但我找不到正确的使用方法。

来自我的“smartctl”的一些信息:

Error 95 occurred at disk power-on lifetime: 20184 hours (841 days + 0 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 d7 55 dd 02  Error: UNC at LBA = 0x02dd55d7 = 48059863

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 d6 55 dd e2 00  18d+05:13:42.421  READ DMA
  27 00 00 00 00 00 e0 00  18d+05:13:42.392  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 02  18d+05:13:42.378  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 02  18d+05:13:42.355  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00  18d+05:13:42.327  READ NATIVE MAX ADDRESS EXT

 SMART Self-test log structure revision number 1
 Num  Test_Description    Status                  Remaining  LifeTime(hours)        LBA_of_first_error
 # 1  Extended offline    Completed: read failure       90%     20194         48059863
 # 2  Short offline       Completed without error       00%     15161         -

有了那个“坏的 LBA”(48059863),如何使用 hdparm?参数“--read-sector”和“--write-sector”应该具有什么类型的地址?

如果我发出命令hdparm --读取扇区 48095863 /dev/sda它读取并转储数据。如果此命令正确,我应该预料到会出现 I/O 错误,对吗?

相反,它会转储数据:

$ ./hdparm --read-sector 48059863 /dev/sda

/dev/sda:
reading sector 48059863: succeeded
4b50 5d1b 7563 a932 618d 1f81 4514 2343
8a16 3342 5e36 2591 3b4e 762a 4dd7 037f
6a32 6996 816f 573f eee1 bc24 eed4 206e
(...)

答案1

如果您出于某种原因想要尝试清除这些坏扇区,并且您不关心驱动器的现有内容,则下面的 shell 代码片段可能会有所帮助。我在一个旧的 Seagate Barracuda 驱动器上测试了这一点,该驱动器早已过了保修期。它可能无法与其他驱动器型号或制造商配合使用,但如果您必须编写一些脚本。它将要销毁驱动器上的所有内容。

您可能更喜欢只运行 badblocks,hdparm 安全擦除 (SE) (https://wiki.archlinux.org/index.php/Securely_wipe_disk),或者其他专门为此设计的工具。或者甚至是制造商提供的工具,如 SeaTools(有一个 32 位 Linux“企业”版本,谷歌一下)。

在执行此操作之前,请确保有问题的驱动器完全未使用/卸载。另外,我知道,while 循环,没有借口。这是一个 hack,你可以做得更好...

baddrive=/dev/sdb
badsect=1
while true; do
  echo Testing from LBA $badsect
  smartctl -t select,${badsect}-max ${baddrive} 2>&1 >> /dev/null

  echo "Waiting for test to stop (each dot is 5 sec)"
  while [ "$(smartctl -l selective ${baddrive} | awk '/^ *1/{print substr($4,1,9)}')" != "Completed" ]; do
    echo -n .
    sleep 5
  done
  echo

  badsect=$(smartctl -l selective ${baddrive} | awk '/# 1  Selective offline   Completed: read failure/ {print $10}')
  [ $badsect = "-" ] && exit 0

  echo Attempting to fix sector $badsect on $baddrive
  hdparm --repair-sector ${badsect} --yes-i-know-what-i-am-doing $baddrive
  echo Continuning test
done

使用“自检”方法的一个优点是负载由驱动器固件处理,因此它所连接的 PC 不会像 dd 或坏块那样被负载压垮。

注意:抱歉,我犯了一个错误,正确的 while 条件是这样的:

while [ "$(smartctl -l selective ${baddrive} | awk '/^ *1/{print $4}')" = "Self_test_in_progess" ]; do

脚本的退出条件变成:

[ $badsect = "-" ] || [ "$badsect" = "" ] && exit 0

答案2

我认为它可能读取时没有错误,因为该扇区没有坏,但是其他工具由于其他行为而无法读取该扇区。(预读到达实际上无法读取的扇区?)

我发现了一些坏扇区,如果我使用“hdparm --read-sector”修复唯一无法读取的扇区,其他“坏”扇区就会突然不再无法读取,例如使用 dd。有趣的是,当查看“dmesg”输出时,只会报告 hdparm 无法读取的扇区。

例如。我的扇区 36589320 到 36589327 和 36589344 到 36589351 无法用 dd 读取,但只有 36589326 和 36589345 无法用 hdparm --read-sector 读取。然后我对这两个扇区使用了 hdparm --write-sector,然后所有 16 个扇区都再次可读。

以下是 dmesg 输出的一小部分:

[30152036.527940] end_request: I/O error, dev sda, sector 36589326
[30152077.363710] end_request: I/O error, dev sda, sector 36589345

磁盘信息:

# smartctl -i /dev/sda
...
=== START OF INFORMATION SECTION ===
Device Model:     TOSHIBA MK2002TSKB
...
Firmware Version: MT2A
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Size:      512 bytes logical/physical
...

显然,该磁盘的固件要么没有正确记录重新分配的扇区,要么它们实际上并没有重新分配,而只是损坏了(例如不可恢复的 ECC 错误,但表面仍然有效,就像是由位腐烂而不是电子设备故障或媒体损坏引起的):

# smartctl -A /dev/sda | egrep "Reallocated|Pending|Uncorrectable"
  5 Reallocated_Sector_Ct   0x0033   100   100   050    Pre-fail  Always       -       0
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0

# smartctl -l error /dev/sda
...
SMART Error Log Version: 1
No Errors Logged

请注意,我运行了 --read-sector 和 --write-sector。可能需要读取才能正确重新分配扇区,而不仅仅是写入。如果您不先读取,它可能不知道该扇区是坏的。

答案3

根据@Glenn 的回答,你会发现脚本修复错误

http://wiki.bitplan.com/index.php/Bad_Block_Howto

截至 2020-09-10,脚本内容如下:

#!/bin/bash 
# see http://wiki.bitplan.com/index.php/Bad_Block_Howto
# see https://github.com/hradec/fix_smart_last_bad_sector/blob/master/fix_smart_last_bad_sector.sh
# see https://www.thomas-krenn.com/de/wiki/Analyse_einer_fehlerhaften_Festplatte_mit_smartctl
# WF 2020-10-04 
disk=/dev/sdb
mode=short
# verbose
verbose=false
# should commands only be shown?
dry=false
# should write fixes be performed?
fix=false
# range of sectors to modify after bad sector
range=8
# set to sudo if sudo is needed
sudo=sudo
# serial number
serial="-?-"

#ansi colors
#http://www.csc.uvic.ca/~sae/seng265/fall04/tips/s265s047-tips/bash-using-colors.html
blue='\033[0;34m'  
red='\033[0;31m'  
green='\033[0;32m' # '\e[1;32m' is too bright for white bg.
endColor='\033[0m'

#
# a colored message 
#   params:
#     1: l_color - the color of the message
#     2: l_msg - the message to display
#
color_msg() {
  local l_color="$1"
  local l_msg="$2"
  echo -e "${l_color}$l_msg${endColor}"
}

#
# error
#
#   show an error message and exit
#
#   params:
#     1: l_msg - the message to display
error() {
  local l_msg="$1"
  # use ansi red for error
  color_msg $red "Error: $l_msg" 1>&2
  exit 1
}

#
# show the usage
#
usage() {
  echo "usage: $0 [disk]"
  echo "   [-c|--check]"
  echo "   [-d|--dry]"
  echo "   [-h|--help]"
  echo "   [-i|--info]"
  echo "   [[-m|--mode] mode]"
  echo "   [[-r|--range] range]"
  echo "   [[-s|--serial [serial]]"
  echo "   [-t|--test]" 
  echo "   [[-w|--wait [type]]"
  echo "   [-v|--verbose]"
  echo
  echo "  -h|--help: show this usage"
  echo "  -c|--check: check the disk"
  echo "  -d|--dry:  dry run - show commands only"
  echo "  -i|--info: show info about the given disk"
  echo "  -m|--mode: set mode: default=short"
  echo "  -r|--range: range of sectors to modify after bad sector"
  echo "  -s|--serial: get serial number of confirm serial number"
  echo "  -t|--test: run test for the given type e.g. selective selftest"
  echo "  -w|--wait: wait for the result of the given testype e.g. selective selftest"
  echo "  -v|--verbose: set verbose mode"
  echo ""
  echo "example:"
  echo "   $0 /dev/sdb -i"
  echo ""
  echo "for any write operation you need to confirm the serial number"
  echo "to get serial number: "
  echo "   $0 disk -s "
  exit 1
}

#
# get a number range from 0 to the given n-1
#
# params 
#   1: n 
function getRange() {
  local l_n="$1"
  range=$(python -c "for i in range($l_n): print i,")
  echo $range
}

#
# read the result of the smartctl test for the given disk
#
# params
#   1: l_disk: the disk under test e.g. /dev/sdb
#   2: l_type: the type of the test e.g. selective
function readResult() {
   local l_disk="$1"
   local l_type="$2"
     $sudo smartctl -l $l_type $l_disk  | egrep "^#?[[:space:]]*[0-9]"
}

#
# show the Result
#
function showResult() {
  local l_logline="$1"
  local l_logstatus="$2"
  if [ "$verbose" == "true" ]
  then
    echo $l_logstatus:$l_logline  
  else
    echo $l_logline | gawk '
    /#/ {
      print $0; exit
    }
    { 
       status=substr($4,1,9)
       progress=$5;
       gsub("\\[","",progress);
       range=$7 
       printf("\r%s",progress);
     }'
  fi
}

#
# wait for the result of a running selftest
#
# param 1: l_disk: the disk under test e.g. /dev/sdb
# param 2: l_type: the type of the test e.g. selective
# param 3: l_wait: number of seconds to wait 
#
function waitForResult() {
   # example
   #=== START OF READ SMART DATA SECTION ===
   #SMART Selective self-test log data structure revision number 1
   #SPAN  MIN_LBA     MAX_LBA  CURRENT_TEST_STATUS
   #         1  7814037167  Self_test_in_progress [90% left] (2564632-2630167)
   local l_disk="$1"
   local l_type="$2"
   local l_wait="$3"
   local l_logline=""
   local l_logstatus=""
   color_msg $blue "Waiting for $l_type test of $l_disk to stop (each dot is $l_wait sec)"
   while [ "$l_logstatus" != "Completed" ]; do
     l_logline=$(readResult "$l_disk" "$l_type"  | egrep "^#?[[:space:]]*1")
     l_logstatus=$(echo $l_logline | gawk ' /Completed/ { print "Completed"; }')
     showResult "$l_logline" "$l_logstatus"
     sleep $l_wait 
   done
}

#
# get the serial number of the device
#
function getSerialNumber() {
  local l_disk="$1"
  serial=$($sudo smartctl -i  $l_disk  | grep "Serial Number" | cut -f 2 -d':')
  echo $serial
}

#
# get the blocksize of the given file system
#
function getBlockSize() {
  local l_fs="$1"
  blocksize=$($sudo tune2fs -l $l_fs | grep "Block size:" | cut -f2 -d':')
  echo $blocksize
}

#
# get the partition for the given disk
#
function getPartition() {
  local l_disk="$1"
  fs=$(mount | grep $l_disk | cut -f1 -d' ')
  echo $fs
}

#
# get the start sector for the given disk
#
function getStartSector() {
  local l_disk="$1"
  local l_fs="$2"
  startsector=$($sudo fdisk -l $l_disk | grep $l_fs | cut -f4 -d' ')
  echo $startsector
}

#
# get Info about the given disk
#
function getInfo() {
  local l_disk="$1"
  $sudo smartctl -i $l_disk | egrep "(Model|Serial|Rotation|Sector|Capacity)"
  $sudo hdparm -I $l_disk | egrep "(Serial Number|Model)"
  fs=$(getPartition $l_disk)
  if [ "$fs" != "" ]
  then
    color_msg $blue "Partition:        $fs"
    blocksize=$(getBlockSize $fs)
    color_msg $blue "Blocksize:        $blocksize"
  else
    color_msg $red "couldn't find mounted partition for $l_disk"
  fi
}

#
# geh the current pending sector for the given disk
#
function getCurrentPendingSector() {
   local l_disk="$1"
   # if msg is empty don't show message but only return the current pending sector count
   local l_msg="$2"
   psectorline=$($sudo smartctl -A $l_disk | grep Current_Pending_Sector)
   psector=0
   if [ $? -eq 0 ]
   then
     if [ "$l_msg" != "" ]; then color_msg $green "$psectorline"; fi
     psector=$(echo $psectorline | cut -f 10 -d ' ')
     if  [ $psector -gt 0 ]
     then
        if [ "$l_msg" != "" ]; then color_msg $red "Current_Pending_Sector is not zero but $psector"; fi
     else
        if [ "$l_msg" != "" ]; then color_msg $green "Current_Pending_Sector is zero!"; fi
     fi
   else
     if [ "$l_msg" != "" ]; then color_msg $red "smartctl -A did not output Current_Pending_Sector"; fi
     psector=-1
   fi
   if [ "$l_msg" == "" ]; then echo $psector; fi
}

#
# fix the given bad sector on the given disk with the given range of sectors to fix
#
# param 1: disk e.g. /dev/sdb1
# param 2: defect sector to repair
# param 3: range - range of sectors to repair e.g. 8
# 
fixBad() {
  local l_disk="$1"
  local l_sector="$2"
  local l_range="$3"
  color_msg $blue "repairing sector $l_sector to $l_sector+$l_range on $l_disk ..."
  r=$(getRange $l_range)
    for i in $r ; do
        let b1=$l_sector+$i
    if [ "$dry" == "true" ]
    then
          echo hdparm --repair-sector $b1  --yes-i-know-what-i-am-doing  $l_disk
    else
            $sudo hdparm --repair-sector $b1  --yes-i-know-what-i-am-doing  $disk >> /tmp/smart_repaired.log
    fi
    done
    #tail -n 60 /tmp/smart_repaired.log | grep writing | tail -n 20
    #grep '#' /tmp/smart | head -5
    #hdparm -I $disk > /tmp/hdparm
}

#
# check the needed software
#
checkSoftware() {
  for sw in gawk debugfs fdisk hdparm smartctl tune2fs python $sudo
  do
    bin=$(which $sw)
    if [ $? -eq 0 ]
    then
      if [ "$verbose" == "true" ]
      then
        color_msg $green "will use $bin as $sw"
      fi
    else
      error "$0 needs $sw to work please install it"
    fi
  done
}

#
# run a test for the given disk in the given mode
# 
# params
#   1: l_disk: the disk under test e.g. /dev/sdb
#   2: l_mode: the mode of the self test e.g. short/long 
function runTest() {
   local l_disk="$1"
   local l_mode="$2"
   color_msg $blue  "running $l_mode smartctl test for $l_disk ..."
     $sudo smartctl -t $l_mode $l_disk > /tmp/null
}

#
# check the given disk in the given mode
#
function checkDisk() {
   local l_disk="$1"
   local l_mode="$2"
   local l_serial="$3"
   fs=$(getPartition $l_disk)
   blocksize=$(getBlockSize $fs)
   startsector=$(getStartSector $l_disk $fs)
   color_msg $blue "checking Current_Pending_Sector count for $l_disk partition $fs blocksize $blocksize startsector $startsector"
   getCurrentPendingSector "$l_disk" show
   psector=$(getCurrentPendingSector "$l_disk")
   if [ $psector -gt 0 ]
   then
     runTest $l_disk $l_mode
   fi
}

#
# check the lba block
#
function lbaCheck() {
  local l_disk="$1"
  fs=$(getPartition $l_disk)
  blocksize=$(getBlockSize $fs)
  startsector=$(getStartSector $l_disk $fs)
  diskserial=$(getSerialNumber $l_disk)
  readResult "$l_disk" selftest | while read line
  do
    echo $line | grep "read failure" > /dev/null
    if [ $? -eq 0 ]
    then
      if [ "$verbose" == "true" ]
      then
        echo $line
      fi
      index=$(echo $line | cut -f2 -d' ')
      state=$(echo $line | cut -f3-4 -d ' ')
      progress=$(echo $line | cut -f8 -d ' ')
      lba=$(echo $line | cut -f10 -d ' ')
      if [ "$lba" == "" ]
      then 
        lba=0
      fi
      if  [ "$lba" -gt 0 ]
      then
        echo $index $state 
        echo "progress:  $progress"
        echo "lba: $lba"
        # calculate the file system block
        fsb=$(gawk -v L=$lba -v S=$startsector -v B=$blocksize 'BEGIN {printf ("%.0f",((L-S)*512/B))}')
        echo "file system block: $fsb"
        if [ "$fix" == "true" ]
        then
          if [ "$serial" != "$diskserial" ]
          then
            color_msg $red "you need to provide the serial number of $l_disk to perform fix operations"
          else
            fixBad $l_disk $lba $range
          fi
        fi
      fi
    fi
  done
}

#
# try Fixing bad sectors
#
function tryFix() {
   local l_disk="$1"
      badsect=$($sudo smartctl -l selective ${baddrive} | gawk '/# 1  Selective offline   Completed: read failure/ {print $10}')
      [ $badsect = "-" ] && exit 0

    echo Attempting to fix sector $badsect on $baddrive
    echo hdparm --repair-sector ${badsect} --yes-i-know-what-i-am-doing $baddrive
}

#
# start a check loop on the given drive
#
function checkLoop() {
   local baddrive="$1"
   badsect=1
   while true; do
      color_msg $blue "Testing $baddrive from LBA $badsect"
      $sudo smartctl -t select,${badsect}-max ${baddrive} 2>&1 >> /dev/null
      waitForResult $baddrive selective 5
      tryFix $baddrive
      color_msg $blue "running next test" 
  done
}
  
# make sure the needed software is available
checkSoftware
# commandline option
while [  "$1" != ""  ]
do
  option=$1
  shift
  case $option in
    -h|--help)
       usage
       ;;
    -i|--info)
      getInfo $disk
      ;;
    -m|--mode)
      if [ $# -lt 1 ]
      then
        usage
      else
        mode=$1
        shift
      fi
      ;;
    -c|--check)
      checkDisk $disk $mode $serial
      ;;
    -d|--dry)
      dry=true
      ;;
    -l|--loop)
      checkLoop $disk
      ;;
    -f|--fix)
      fix=true
      ;;
    -r|--range)
      if [ $# -lt 1 ]
      then
        usage
      else
        range=$1
        shift
      fi
      ;;
    -s|--serial)
      if [ $# -lt 1 ]
      then
        getSerialNumber $disk
        exit 1
      else
        serial=$1
        shift
      fi
      ;;
    -t|--test)
      runTest $disk $mode
      ;;
    -v|--verbose)
      verbose=true
      ;;
    -w|--wait)
      if [ $# -lt 1 ]
      then
        usage
      else
        type=$1
        shift
        waitForResult $disk $type 5
      fi
      ;;
    -x)
      lbaCheck $disk $serial;;
    *)
      disk=$option
      ;;
  esac
done

就我个人而言,我无法使用此工具包获得任何有意义的结果,即“修复”磁盘。但脚本及其部分对于分析和尝试修复仍然很有帮助。请谨慎使用该脚本,以期实现“完全自动化”。您可能会丢失数据,而不是修复数据。

相关内容