下面是我每天用来从网站下载文件的脚本的一部分。然而最近他们增加了下载文件的速度限制。我增加了睡眠时间,但其他一切都花费了太长的时间,有很多文件需要下载,有些文件非常小。
我想删除sleep
等待时间或将其设置得很低并修改脚本,以便它将等待文件完成下载
编辑:
我找到了较大文件未完成下载的原因。Failure when receiving data from the peer
我怎样才能解决这个问题?我读过切换到 wget 是最好的选择,但是这个脚本如何与 wget 一起工作?
#check directories are empty, not empty if there was a problem last time
cd /home/user/upload
if [ "$(ls -A /home/user/upload)" ]; then
# echo 'Directory not empty error for csv manipulation' | /bin/mailx -s "Server scrapeandcleandomains error" use
echo "$(date) Directory /home/user/upload not empty for csv manipulation" >> /home/user/logfile
exit 1
else
echo $(date) starting normal >> /home/user/logfile
fi
#create yesterday variable
yesterday=$(echo $(date --date="$1 - 2 days" +"%Y_%m_%d" ) )
#$(date --date="-2 day" +"%Y_%m_%d")
#download .csv.gz files (old wget command) OBSOLETE!!!!!
#cd /home/user/upload
#wget -R html,"index.*" -A "$yesterday*.csv.gz" -N -r -c -l1 -nd --no-check-certificate --user USERNAME --password PASSWORD -np http://www.websitedownloadfrom.com/sub/
#exit 1
#download index and sanitize > index2.tmp
cd /home/user
curl -u "USERNAME:PASSWORD" -k http://www.websitedownloadfrom.com/sub/ -o index.html.tmp
links -dump index.html.tmp > /home/user/index.tmp
#this will work until 2049 ONLY!!
sed -i '/20[1-4][0-9]/!d' index.tmp
sed -i '/\[DIR\]/d' index.tmp
for i in {1..50} ; do
sed -i 's/ / /' index.tmp
done
awk -F" " '{ print $3 }' index.tmp > index2.tmp
sed -i "/^${yesterday}/!d" index2.tmp
#download .csv.gz files according to index2.tmp
while read F ; do
cd /home/user/upload
curl -u "USERNAME:PASSWORD" -k http://www.websitedownloadfrom.com/sub/$F -o $F &
sleep 80
done < /home/user/index2.tmp
sleep 60
#check that we downloaded something
cd /home/user/upload
if ! [ "$(ls -A /home/user/upload)" ]; then
echo 'nothing downloaded from upload' >> /home/user/logfile
rm -f /home/user/upload/*
rm -f /home/user/index.html.tmp
rm -f /home/user/index.tmp
rm -f /home/user/index2.tmp
exit 1
fi
答案1
删除该sleep 80
命令以及紧邻其之前的命令&
中的。curl
删除&
将使脚本等待curl
下载完成,然后再继续进行下一次循环。