我正在运行我的 shell 脚本,该machineA
脚本将文件machineB
从.machineC
machineA
如果文件不存在于 中,那么它肯定machineB
应该存在于 中。machineC
所以我会尝试首先复制machineB
,如果不存在,machineB
那么我将去machineC
复制相同的文件。
在这个文件夹里面machineB
会有machineC
一个这样的文件夹-YYYYMMDD
/data/pe_t1_snapshot
因此,无论日期是上述文件夹内这种格式的最新日期YYYYMMDD
- 我都会选择该文件夹作为我需要开始复制文件的完整路径 -
20140317
所以假设如果这是里面的最新日期文件夹/data/pe_t1_snapshot
那么这将是我的完整路径 -
/data/pe_t1_snapshot/20140317
从我需要开始将文件复制到machineB
和 的地方machineC
。我需要复制from和400
中的文件,每个文件大小为.machineA
machineB
machineC
1.5 GB
目前我有下面的 shell 脚本,当我使用时它工作得很好,scp
但rsync
不知何故需要5 hours
复制400
machineA 中的文件,我猜这对我来说太长了。 :(
下面是我的 shell 脚本 -
#!/bin/bash
readonly PRIMARY=/export/home/david/dist/primary
readonly SECONDARY=/export/home/david/dist/secondary
readonly FILERS_LOCATION=(machineB machineC)
readonly MEMORY_MAPPED_LOCATION=/data/pe_t1_snapshot
PRIMARY_PARTITION=(0 3 5 7 9)
SECONDARY_PARTITION=(1 2 4 6 8)
dir1=$(ssh -o "StrictHostKeyChecking no" david@${FILERS_LOCATION[0]} ls -dt1 "$MEMORY_MAPPED_LOCATION"/[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9] | head -n1)
dir2=$(ssh -o "StrictHostKeyChecking no" david@${FILERS_LOCATION[1]} ls -dt1 "$MEMORY_MAPPED_LOCATION"/[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9] | head -n1)
echo $dir1
echo $dir2
if [ "$dir1" = "$dir2" ]
then
# delete all the files first
rm -rf $PRIMARY/*
# below for-loop copies one file at a time in PRIMARY folder
for el in "${PRIMARY_PARTITION[@]}"
do
scp david@${FILERS_LOCATION[0]}:$dir1/t1_weekly_1680_"$el"_200003_5.data $PRIMARY/. || scp david@${FILERS_LOCATION[1]}:$dir2/t1_weekly_1680_"$el"_200003_5.data $PRIMARY/.
done
# delete all the files first
rm -rf $SECONDARY/*
# below for-loop copies one file at a time in SECONDARY folder
for sl in "${SECONDARY_PARTITION[@]}"
do
scp david@${FILERS_LOCATION[0]}:$dir1/t1_weekly_1680_"$sl"_200003_5.data $SECONDARY/. || scp david@${FILERS_LOCATION[1]}:$dir2/t1_weekly_1680_"$sl"_200003_5.data $SECONDARY/.
done
fi
我正在复制文件夹PRIMARY_PARTITION
中的文件PRIMARY
和文件夹SECONDARY_PARTITION
中的文件。SECONDARY
machineA
现在我的问题是 - 我将如何使用rsync
此处而不是scp(ing)
文件?据我所知,这比文件rsync
快得多。scp(ing)
我希望有与我的 shell 脚本中相同的逻辑rsync
。我以前从未合作过,rsync
所以遇到了一些问题。
谁能提供一个例子吗?
鉴于我的用例,rsync
与 scp 相比会更快吗?如果不是,我可以尝试其他哪些选项来加快文件传输速度?
更新:-
为了澄清 terdon 问题 -
在这个问题中,我仅显示 10 个文件,仅作为示例 -
PRIMARY_PARTITION=(0 3 5 7 9)
SECONDARY_PARTITION=(1 2 4 6 8)
一般来说,在PRIMARY_PARTITION
数组中,我将有大约 150 个文件编号,然后在数组中SECONDARY_PARTITION
我将有另外 200 个文件编号。
现在我需要做的是无论我有多少文件号PRIMARY_PARTITION
,我都需要去找出machineB
目录中的那些文件,如果文件已经存在,则将其复制到PRIMARY
文件夹中,machineA
但如果文件不存在machineB
则它应该在那里,machineC
所以从现在开始复制文件machineC
并将其放入PRIMARY
中的文件夹中machineA
。
同样,我需要对 做同样的事情SECONDARY_PARTITION
,我将去查找machineB
目录中的那些文件,如果存在,则将其复制到machineA
辅助目录中,但如果它不存在于 中machineB
,那么它应该存在于中,machineC
因此将其复制machineC
并放入在machineA
二级目录中。
所以我们拥有的所有文件编号都在 -PRIMARY_PARTITION
和中SECONDARY_PARTITION
。
一般来说,我会有PRIMARY_PARTITION
这样SECONDARY_PARTITION
的 -
PRIMARY_PARTITION=(0 548 272 4 544 276 8 556 280 12 552 284 16 256 564 20 260 560 24 264 572 28 268 568 516 304 32 512 308 36 524 312 40 520 316 44 288 532 48 292 528 52 296 540 56 300 536 60 68 608 340 64 336 76 348 72 344 84 324 80 320 92 332 88 328 576 372 100 580 368 96 584 380 108 588 376 104 356 592 116 352 596 112 364 600 124 360 604 120 136 408 140 412 128 400 132 404 152 392 156 396 144 384 148 388 440 168 444 172 432 160 436 164 424 184 428 188 416 176 420 180 204 476 200 472 196 468 192 464 220 460 216 456 212 452 208 448 508 236 504 232 500 228 496 224 492 252 488 248 484 244 480 240)
SECONDARY_PARTITION=(1101 1374 1641 1371 1647 1098 1635 1365 1095 1638 1089 1362 1659 1359 1119 1113 1662 1353 1350 1650 1110 1347 1653 1107 1134 1407 1611 1401 1131 1614 1602 1125 1398 1122 1605 1395 1389 1149 1626 1629 1146 1386 1617 1143 1383 1377 1623 1137 1305 1581 1578 1311 1299 1575 1302 1569 1599 1290 1593 1293 1590 1281 1587 1287 1551 1338 1341 1545 1071 1329 1542 1335 1539 1083 1566 1323 1086 1563 1326 1557 1074 1314 1317 1077 1554 1221 1494 1491 1218 1503 1230 1227 1497 1479 1239 1233 1473 1245 1485 1482 1242 1254 1527 1251 1521 1263 1533 1530 1257 1509 1269 1266 1506 1278 1518 1275 1515 1155 1425 1431 1158 1434 1161 1167 1437 1410 1170 1173 1413 1419 1179 1422 1182 1671 1458 1185 1665 1191 1461 1677 1194 1467 1470 1197 1674 1203 1443 1206 1446 1449 1209 1215 1455)
另一个更新:-
删除后2>/dev/null
,我再次运行脚本,但出现以下错误 -
ssh: Could not resolve hostname machineB : Name or service not known rsync: connection unexpectedly closed (0 bytes received so far) [Receiver] rsync error: unexplained error (code 255) at io.c(605) [Receiver=3.0.9] ssh: Could not resolve hostname machineC : Name or service not known rsync: connection unexpectedly closed (0 bytes received so far) [Receiver] rsync error: unexplained error (code 255) at io.c(605) [Receiver=3.0.9] ssh: Could not resolve hostname machineB : Name or service not known rsync: connection unexpectedly closed (0 bytes received so far) [Receiver] rsync error: unexplained error (code 255) at io.c(605) [Receiver=3.0.9] ssh: Could not resolve hostname machineC : Name or service not known rsync: connection unexpectedly closed (0 bytes received so far) [Receiver] rsync error: unexplained error (code 255) at io.c(605) [Receiver=3.0.9]
有什么想法发生了什么事吗?在运行 shell 脚本之前,我已将machineB
和替换为实际名称,我的系统是 -machineC
root@machineA:/home/david# uname -a
Linux machineA 3.2.0-24-generic #37-Ubuntu SMP Wed Apr 25 08:43:22 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux
下面是我正在运行的 shell 脚本 -
#!/usr/bin/env bash
readonly PRIMARY=/export/home/david/dist/primary
readonly SECONDARY=/export/home/david/dist/secondary
readonly FILERS_LOCATION=(machineB machineC)
readonly MEMORY_MAPPED_LOCATION=/data/pe_t1_snapshot
PRIMARY_PARTITION=(0 3 5 7 9)
SECONDARY_PARTITION=(1 2 4 6 8)
dir1=$(ssh -o "StrictHostKeyChecking no" david@${FILERS_LOCATION[0]} ls -dt1 "$MEMORY_MAPPED_LOCATION"/[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9] | head -n1)
dir2=$(ssh -o "StrictHostKeyChecking no" david@${FILERS_LOCATION[1]} ls -dt1 "$MEMORY_MAPPED_LOCATION"/[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9] | head -n1)
echo $dir1
echo $dir2
## Build your list of filenames before the loop.
for n in "${PRIMARY_PARTITION[@]}"
do
primary_files="$primary_files :$dir1"/t1_weekly_1680_"$n"_200003_5.data
done
## Repeat for $SECONDARY_PARTITION
for n in "${SECONDARY_PARTITION[@]}"
do
secondary_files="$secondary_files :$dir2"/t1_weekly_1680_"$n"_200003_5.data
done
echo $primary_files
echo $secondary_files
if [ "$dir1" = "$dir2" ]
then
find "$PRIMARY" -mindepth 1 -delete
find "$SECONDARY" -mindepth 1 -delete
rsync -avz david@${FILERS_LOCATION[0]}"${primary_files}" $PRIMARY/
rsync -avz david@${FILERS_LOCATION[1]}"${primary_files}" $PRIMARY/
## Do the same for $secondary_partition files
rsync -avz david@${FILERS_LOCATION[0]}"${secondary_files}" $SECONDARY/
rsync -avz david@${FILERS_LOCATION[1]}"${secondary_files}" $SECONDARY/
fi
我怀疑rsync
语法可能不正确。因为如果我像这样运行单个命令,那么它工作得很好 -
rsync -avz david@machineB":/data/pe_t1_snapshot/20140317/t1_weekly_1680_0_200003_5.data" /export/home/david/dist/primary
另一个小更新:-
如果我这样跑步——
root@machineA:/export/home/david# rsync -avz david@machineB':/data/pe_t1_snapshot/20140317/t1_weekly_1680_0_200003_5.data :/data/pe_t1_snapshot/20140317/t1_weekly_1680_1_200003_5.data' /data01/primary
receiving incremental file list
rsync: change_dir "/home/david/:/data/pe_t1_snapshot/20140317" failed: No such file or directory (2)
t1_weekly_1680_0_200003_5.data
sent 30 bytes received 504982813 bytes 6196108.50 bytes/sec
total size is 1761988281 speedup is 3.49
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1536) [generator=3.0.9]
上面的命令应该将文件复制到/data01/primary
目录,但它只复制一个文件,并且不会复制第二个文件。
但这工作正常并且一个文件被复制 -
root@machineA:/export/home/david# rsync -avz david@machineB':/data/pe_t1_snapshot/20140317/t1_weekly_1680_0_200003_5.data' /data01/primary
receiving incremental file list
t1_weekly_1680_0_200003_5.data
sent 30 bytes received 504982698 bytes 6351984.00 bytes/sec
total size is 1761988281 speedup is 3.49
答案1
您的脚本的主要问题是您scp
为每个文件打开一个单独的连接,这会添加一个很多不必要的开销。你可以尝试这样的事情:
#!/usr/bin/env bash
readonly PRIMARY=/export/home/david/dist/primary
readonly SECONDARY=/export/home/david/dist/secondary
readonly FILERS_LOCATION=(machineB machineC)
readonly MEMORY_MAPPED_LOCATION=/data/pe_t1_snapshot
PRIMARY_PARTITION=(0 548 272 4 544 276 8 556 280 12 552 284 16 256 564 20 260 560 24 264 572)
SECONDARY_PARTITION=(1101 1374 1641 1371 1647 1098 1635 1365 1095 1638 1089 1362 1659 1359)
dir1=$(ssh -o "StrictHostKeyChecking no" david@${FILERS_LOCATION[0]} ls -dt1 "$MEMORY_MAPPED_LOCATION"/[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9] | head -n1)
dir2=$(ssh -o "StrictHostKeyChecking no" david@${FILERS_LOCATION[1]} ls -dt1 "$MEMORY_MAPPED_LOCATION"/[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9] | head -n1)
## Build your list of filenames before the loop.
for n in "${PRIMARY_PARTITION[@]}"
do
primary_files="$primary_files :$dir1"/t1_weekly_1680_"$n"_200003_5.data
done
## Repeat for $SECONDARY_PARTITION
for n in "${SECONDARY_PARTITION[@]}"
do
secondary_files="$secondary_files :$dir2"/t1_weekly_1680_"$n"_200003_5.data
done
if [ "$dir1" = "$dir2" ]
then
## I am using find largely because the *
## in rm -rf "$PRIMARY"/* screws up the syntax
## highlighting on the site and it is a good habit to
## get into anyway. Feel free to use rm -rf in your script.
find "$PRIMARY" -mindepth 1 -delete
find "$SECONDARY" -mindepth 1 -delete
## rsync can be run with this format:
## rsync user@dest:/target/path1 :/target/path2 :/target/pathN /dest/path
#
## which is why I added the : in the loop above. So, these commands will
## open only 2 conections per file list. First you will try to copy all $primary_partition
## files from machineA, then all $primary_partition files from machineB.
## rsync will complain about files not found (which is why I'm redirecting standard
## error to /dev/null) but will continue. You then repeat the process for machineC.
rsync -avz david@${FILERS_LOCATION[0]}"${primary_files}" $PRIMARY/ 2>/dev/null
rsync -avz david@${FILERS_LOCATION[1]}"${primary_files}" $PRIMARY/ 2>/dev/null
## Do the same for $secondary_partition files
rsync -avz david@${FILERS_LOCATION[0]}"${secondary_files}" $SECONDARY/ 2>/dev/null
rsync -avz david@${FILERS_LOCATION[1]}"${secondary_files}" $SECONDARY/ 2>/dev/null
fi
答案2
rsync
负责:仅复制已更改的文件,忽略您不想复制的文件(-C
例如,切换,这将排除 CVS 在其存储库中排除的相同文件,尽管您可以指定任何内容),递归复制整个文件目录结构(当然,只有需要的更改,而不是所有内容)。它可以选择压缩流,从而加快传输速度。它也更快,因为它在单个连接中完成整个复制。
由于您仅复制单个文件,因此大多数功能都不会被使用。你会使用
rsync -avz "$firstfile" "$secondfile"
scp
除了标志之外,这与其他完全相同(a - 存档保留权限和时间戳,v 表示详细程度,z 表示压缩)。
但是,您也可以使用 scp 进行压缩:
scp -p -C …
我认为这是这里最简单的解决方案。只需添加一个标志即可完成。
答案3
readonly TGT=/export/home/david/dist
readonly TGT1=${TGT}/primary
readonly TGT2=${TGT}/secondary
readonly MMAP_LOC=/data/pe_t1_snapshot
readonly PART1='t1_weekly_1680_[03579]_200003_5.data' # shell globbing does
readonly PART2='t1_weekly_1680_[12468]_200003_5.data' # the bulk of the work
readonly F_LOC=BC
readonly SSH="david@machine"
#hoping the = works - I don't know
SSH1='ssh -o "StrictHostKeyChecking=no" '"${SSH}${F_LOC%?}"
SSH2="${SSH1%?}${F_LOC#?}"
DIR="${MMAP_LOC}/"'[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]'
DIR1="$($SSH1 'cd ${d='"$DIR"'} && echo $d')" #shell glob
DIR2="$($SSH2 'cd ${d='"$DIR"'} && echo $d')" #shell glob
${DIR1:?FAIL} [ -n "${DIR1#"$DIR2"}" ] && exit 1 #tests if d1=d2 or dies
F1="$($SSH1 'printf "%s\n" '"${DIR1}/${PART1}")" #prefers primary
F1="${F1}$(echo ; $SSH1 'for f in '"${DIR2}/${PART1}"'\ #shell glob in
do { case "'"$F1"'" in "${f#'"$DIR2"'}") continue ;;\ # favor
*) printf "%s\n" "$f" ;;\ #of files found in primary
esac ; } ; done')" #with secondary as backup
F2="$($SSH2 'printf "%s\n" '"${DIR2}/${PART2}")" #secondary
rsync -avzt -e "${SSH1}:/" "${TGT1}"/. \ #if it works, based on your
--exclude=* $(printf --include=%s\\n $F1) #file sizes, should
rsync -avzt -e "${SSH2}:/" "${TGT2}"/. \ #dramatically decrease
--exclude=* $(printf --include=%s\\n $F2) #transfer times
这有效吗?