镜像服务器并忽略已处理的文件

Question

我现在编写了一个简短的脚本来镜像服务器并将文件名保存在数据库中。

您还可以查询 md5 哈希值，例如文件名是否可以重复

import urllib.request as urll
import re
import shelve
import hashlib
import time

res = urll.urlopen(url)

html = res.read()

files = re.findall('<a href="([^"]+)">', str(html))[1:]

db = shelve.open('dl.shelve')

print(files)

for file in files:
    if file not in db:
        print("Downloadling %s..." %file)
        res = urll.urlopen(url + "" + file)
        bytes = res.read()
        md5 = hashlib.md5(bytes).hexdigest()

        with open("dl\\"+file, 'wb') as f:
            f.write(bytes)

        print((time.time(), len(bytes), md5))
        db[file] = (time.time(), len(bytes), md5)

db.close()

Answer 1

我现在编写了一个简短的脚本来镜像服务器并将文件名保存在数据库中。

您还可以查询 md5 哈希值，例如文件名是否可以重复

import urllib.request as urll
import re
import shelve
import hashlib
import time

res = urll.urlopen(url)

html = res.read()

files = re.findall('<a href="([^"]+)">', str(html))[1:]

db = shelve.open('dl.shelve')

print(files)

for file in files:
    if file not in db:
        print("Downloadling %s..." %file)
        res = urll.urlopen(url + "" + file)
        bytes = res.read()
        md5 = hashlib.md5(bytes).hexdigest()

        with open("dl\\"+file, 'wb') as f:
            f.write(bytes)

        print((time.time(), len(bytes), md5))
        db[file] = (time.time(), len(bytes), md5)

db.close()

镜像服务器并忽略已处理的文件

答案1

相关内容