使用 wget 下载但不知道结束索引

Question 1

从中man wget，你可以看到它使用了通常的 Unix 返回值约定 - 0 表示没有错误，其他任何值都是错误。如果你不期望其他类型的错误（例如网络故障或类似情况），即你期望如果它没有下载任何内容就意味着没有文件，则可以使用如下方法：

get_tf_simulated() {
  t=$1
  if [ $t -lt 3 ]; then
    f=$3
    s=$((2 * $t))
    if [ $f -lt $s ]; then
      return 0
    fi
  fi
  return 1
}

get_tf_real() {
  tp=$2
  fp=$4
  inf=$5
  ext=$6
  # Get http://example.com/test<test number>/<image or file><file number>.<jpg or txt>
  wget -Otest$tp_file$fp_$inf.$ext http://example.com/test$tp/$inf$fp.$ext
}

get_tf() {
  echo --- Getting $*
  get_tf_simulated $*
  #get_tf_real $*
}

get_all() {
  get_tf $t $tp $f $fp image jpg
  ret_val=$?
  if [ $ret_val -ne 0 ]; then
    return $ret_val
  fi
  get_tf $t $tp $f $fp file txt
}

for t in {1..999}; do
  tp=`printf %3.3d $t`
  got_one=no
  for f in {1..9999}; do
    fp=`printf %4.4d $f`
    get_all $t $tp $f $fp
    if [ $? -ne 0 ]; then
      echo Failed, going next
      break
    fi
    got_one=yes
  done
  if [ $got_one == 'no' ]; then
    echo Nothing more
    break
  fi 
done

取消注释函数中的正确行get_all。目前，它将模拟它，输出将如下所示（前提是您将上述内容保存到mkt.sh）：

$ ./mkt.sh 
--- Getting 1 001 1 0001 image jpg
--- Getting 1 001 1 0001 file txt
--- Getting 1 001 2 0002 image jpg
Failed, going next
--- Getting 2 002 1 0001 image jpg
--- Getting 2 002 1 0001 file txt
--- Getting 2 002 2 0002 image jpg
--- Getting 2 002 2 0002 file txt
--- Getting 2 002 3 0003 image jpg
--- Getting 2 002 3 0003 file txt
--- Getting 2 002 4 0004 image jpg
Failed, going next
--- Getting 3 003 1 0001 image jpg
Failed, going next
Nothing more

请注意，我没有测试这个wget，但您可以使用它来测试几个文件：

wget -Otest$tp_file$fp_$inf.$ext http://example.com/test$tp/$inf$fp.$ext; echo $?

只需根据需要替换$tp、$fp和$inf，$ext例如与您给出的类似示例：

wget -Otest052_file0001_file.txt http://www.example.com/sub-somewhere052/file0001.txt; echo $?

这应该会响应8404，来自man wget：

8   Server issued an error response.

如果这有效，那么脚本就应该可以工作，希望那一行没有拼写错误。:)

Answer

从中man wget，你可以看到它使用了通常的 Unix 返回值约定 - 0 表示没有错误，其他任何值都是错误。如果你不期望其他类型的错误（例如网络故障或类似情况），即你期望如果它没有下载任何内容就意味着没有文件，则可以使用如下方法：

get_tf_simulated() {
  t=$1
  if [ $t -lt 3 ]; then
    f=$3
    s=$((2 * $t))
    if [ $f -lt $s ]; then
      return 0
    fi
  fi
  return 1
}

get_tf_real() {
  tp=$2
  fp=$4
  inf=$5
  ext=$6
  # Get http://example.com/test<test number>/<image or file><file number>.<jpg or txt>
  wget -Otest$tp_file$fp_$inf.$ext http://example.com/test$tp/$inf$fp.$ext
}

get_tf() {
  echo --- Getting $*
  get_tf_simulated $*
  #get_tf_real $*
}

get_all() {
  get_tf $t $tp $f $fp image jpg
  ret_val=$?
  if [ $ret_val -ne 0 ]; then
    return $ret_val
  fi
  get_tf $t $tp $f $fp file txt
}

for t in {1..999}; do
  tp=`printf %3.3d $t`
  got_one=no
  for f in {1..9999}; do
    fp=`printf %4.4d $f`
    get_all $t $tp $f $fp
    if [ $? -ne 0 ]; then
      echo Failed, going next
      break
    fi
    got_one=yes
  done
  if [ $got_one == 'no' ]; then
    echo Nothing more
    break
  fi 
done

取消注释函数中的正确行get_all。目前，它将模拟它，输出将如下所示（前提是您将上述内容保存到mkt.sh）：

$ ./mkt.sh 
--- Getting 1 001 1 0001 image jpg
--- Getting 1 001 1 0001 file txt
--- Getting 1 001 2 0002 image jpg
Failed, going next
--- Getting 2 002 1 0001 image jpg
--- Getting 2 002 1 0001 file txt
--- Getting 2 002 2 0002 image jpg
--- Getting 2 002 2 0002 file txt
--- Getting 2 002 3 0003 image jpg
--- Getting 2 002 3 0003 file txt
--- Getting 2 002 4 0004 image jpg
Failed, going next
--- Getting 3 003 1 0001 image jpg
Failed, going next
Nothing more

请注意，我没有测试这个wget，但您可以使用它来测试几个文件：

wget -Otest$tp_file$fp_$inf.$ext http://example.com/test$tp/$inf$fp.$ext; echo $?

只需根据需要替换$tp、$fp和$inf，$ext例如与您给出的类似示例：

wget -Otest052_file0001_file.txt http://www.example.com/sub-somewhere052/file0001.txt; echo $?

这应该会响应8404，来自man wget：

8   Server issued an error response.

如果这有效，那么脚本就应该可以工作，希望那一行没有拼写错误。:)

Question 2

如果网站返回404响应，wget则将变量设置$?为非零值（具体为 8，但谁在乎呢）。您可以对此进行测试。

我觉得 bash 相当令人困惑，所以这里有一个 Python (2.7.2) 版本。它应该可以工作，但如果没有方便的网站，我无法直接测试。它取决于服务器是否返回正确的 404 响应。

#! /usr/bin/python

basepath = "http://www.somewhere.com/sub-somewhere"
imgpre = "/image"
imgpost = ".jpg"
txtpre = "/txt"
txtpost = ".txt"

import os
import urllib2

directorynum = 1
filenum = 1

while True:
    pathdir = basepath + str(directorynum).zfill(3)

    if filenum == 1:
        try:
            os.makedirs(pathdir[7:])
        except OSError, e:
            print "Error creating directory: " + e.strerror

    pathimg = pathdir + imgpre + str(filenum).zfill(4) + imgpost
    pathtxt = pathdir + txtpre + str(filenum).zfill(4) + txtpost
    try:        
        print "Getting " + pathimg
        resp = respimg = urllib2.urlopen(pathimg)
        with open(pathimg[7:], "wb") as f:
            f.write(respimg.read())

        print "Getting " + pathtxt
        resp = resptxt = urllib2.urlopen(pathtxt)
        with open(pathtxt[7:], "w") as f:
            f.write(resptxt.read())

        filenum += 1

        continue
    except urllib2.HTTPError, e:
        if e.code == 404:
            print "Error: 404"
            print "Got " + str(filenum - 1) + " from directory " + str(directorynum) + ", incrementing directory."
            directorynum += 1
            filenum = 1
            continue
        else:
            print "An unexpected error (" + resp.code + resp.msg + ") has occurred."
            break

它也可以在 Windows 上正常运行（只需删除#! /usr/bin/python并保存为.py文件，但必须安装 Python 解释器）

Answer

如果网站返回404响应，wget则将变量设置$?为非零值（具体为 8，但谁在乎呢）。您可以对此进行测试。

我觉得 bash 相当令人困惑，所以这里有一个 Python (2.7.2) 版本。它应该可以工作，但如果没有方便的网站，我无法直接测试。它取决于服务器是否返回正确的 404 响应。

#! /usr/bin/python

basepath = "http://www.somewhere.com/sub-somewhere"
imgpre = "/image"
imgpost = ".jpg"
txtpre = "/txt"
txtpost = ".txt"

import os
import urllib2

directorynum = 1
filenum = 1

while True:
    pathdir = basepath + str(directorynum).zfill(3)

    if filenum == 1:
        try:
            os.makedirs(pathdir[7:])
        except OSError, e:
            print "Error creating directory: " + e.strerror

    pathimg = pathdir + imgpre + str(filenum).zfill(4) + imgpost
    pathtxt = pathdir + txtpre + str(filenum).zfill(4) + txtpost
    try:        
        print "Getting " + pathimg
        resp = respimg = urllib2.urlopen(pathimg)
        with open(pathimg[7:], "wb") as f:
            f.write(respimg.read())

        print "Getting " + pathtxt
        resp = resptxt = urllib2.urlopen(pathtxt)
        with open(pathtxt[7:], "w") as f:
            f.write(resptxt.read())

        filenum += 1

        continue
    except urllib2.HTTPError, e:
        if e.code == 404:
            print "Error: 404"
            print "Got " + str(filenum - 1) + " from directory " + str(directorynum) + ", incrementing directory."
            directorynum += 1
            filenum = 1
            continue
        else:
            print "An unexpected error (" + resp.code + resp.msg + ") has occurred."
            break

它也可以在 Windows 上正常运行（只需删除#! /usr/bin/python并保存为.py文件，但必须安装 Python 解释器）

使用 wget 下载但不知道结束索引

答案1

答案2

相关内容