我有一个文本文件(大小超过 1GB),其中包含如下行:
1083021106e581c71003b987a75f18543cf5858b9fcfc5e04c0dddd79cd18764a865ba86d027de6d1900dc171e4d90a0564abbce99b812b821bd0d7d37aad72ead19c17
10840110dbd43121ef0c51a8ba62193eac247f57f1909e270eeb53d68da60ad61519f19cfb0511ec2431ca54e2fcabf6fa985615ec06def5ba1b753e8ad96d0564aa4c
1084011028375c62fd132d5a4e41ffef2419da345b6595fba8a49b5136de59a884d878fc9789009843c49866a0dc97889242b9fb0b8c112f1423e3b220bc04a2d7dfbdff
10880221005f0e261be654e4c52034d8d05b5c4dc0456b7868763367ab998b7d5886d64fbb24efd14cea668d00bfe8048eb8f096c3306bbb31aaea3e06710fa8c0bb8fca71
108501103461fca7077fc2f0d895048606b828818047a64611ec94443e52cc2d39c968363359de5fc76df48e0bf3676b73b1f8fea5780c2af22c507f83331cc0fbfe6ea9
1085022100a4ce8a09d1f28e78530ce940d6fcbd3c1fe2cb00e7b212b893ce78f8839a11868281179b4f2c812b8318f8d3f9a598b4da750a0ba6054d7e1b743bb67896ee62
1086022100638681ade4b306295815221c5b445ba017943ae59c4c742f0b1442dae4902a56d173a6f859dc6088b6364224ec17c4e2213d9d3c96bd9992b696d7c13b234b50
所有以如下内容开头的字符串,.....
10830110
1083021
10840110
10840110
1088022100
10850110
1085022100
1086022100
我需要分离 8 个文件,如何使用 sed 命令
答案1
您可以使用 sed 将前缀文件转换为 sed 命令文件,然后在 sed 命令中使用它来处理大文件 - 这几乎肯定比使用 shell 循环对同一(大)文件多次运行 sed(或 grep)更有效。例如给定
$ cat file2
10830110
1083021
10840110
10840110
1088022100
10850110
1085022100
1086022100
然后
$ sed 's:.*:/^&/w&.txt:' file2
/10830110/w10830110.txt
/1083021/w1083021.txt
/10840110/w10840110.txt
/10840110/w10840110.txt
/1088022100/w1088022100.txt
/10850110/w10850110.txt
/1085022100/w1085022100.txt
/1086022100/w1086022100.txt
以便
$ sed 's:.*:/^&/w&.txt:' file2 | sed -n -f - file1
生产
$ head 108*.txt
==> 10830110.txt <==
==> 1083021.txt <==
1083021106e581c71003b987a75f18543cf5858b9fcfc5e04c0dddd79cd18764a865ba86d027de6d1900dc171e4d90a0564abbce99b812b821bd0d7d37aad72ead19c17
==> 10840110.txt <==
10840110dbd43121ef0c51a8ba62193eac247f57f1909e270eeb53d68da60ad61519f19cfb0511ec2431ca54e2fcabf6fa985615ec06def5ba1b753e8ad96d0564aa4c
10840110dbd43121ef0c51a8ba62193eac247f57f1909e270eeb53d68da60ad61519f19cfb0511ec2431ca54e2fcabf6fa985615ec06def5ba1b753e8ad96d0564aa4c
1084011028375c62fd132d5a4e41ffef2419da345b6595fba8a49b5136de59a884d878fc9789009843c49866a0dc97889242b9fb0b8c112f1423e3b220bc04a2d7dfbdff
1084011028375c62fd132d5a4e41ffef2419da345b6595fba8a49b5136de59a884d878fc9789009843c49866a0dc97889242b9fb0b8c112f1423e3b220bc04a2d7dfbdff
==> 10850110.txt <==
108501103461fca7077fc2f0d895048606b828818047a64611ec94443e52cc2d39c968363359de5fc76df48e0bf3676b73b1f8fea5780c2af22c507f83331cc0fbfe6ea9
==> 1085022100.txt <==
1085022100a4ce8a09d1f28e78530ce940d6fcbd3c1fe2cb00e7b212b893ce78f8839a11868281179b4f2c812b8318f8d3f9a598b4da750a0ba6054d7e1b743bb67896ee62
==> 1086022100.txt <==
1086022100638681ade4b306295815221c5b445ba017943ae59c4c742f0b1442dae4902a56d173a6f859dc6088b6364224ec17c4e2213d9d3c96bd9992b696d7c13b234b50
==> 1088022100.txt <==
10880221005f0e261be654e4c52034d8d05b5c4dc0456b7868763367ab998b7d5886d64fbb24efd14cea668d00bfe8048eb8f096c3306bbb31aaea3e06710fa8c0bb8fca71
您可能希望首先对模式文件进行重复数据删除 - 并可能按数字排序并修改第二个 sed 命令以在第一次匹配后中断,以便您只匹配最长字首:
$ sort -nru file2 | sed 's:.*:/^&/{w&.txt\nb\n}:' | sed -n -f - file1
给予
$ head 108*.txt
==> 10830110.txt <==
==> 1083021.txt <==
1083021106e581c71003b987a75f18543cf5858b9fcfc5e04c0dddd79cd18764a865ba86d027de6d1900dc171e4d90a0564abbce99b812b821bd0d7d37aad72ead19c17
==> 10840110.txt <==
10840110dbd43121ef0c51a8ba62193eac247f57f1909e270eeb53d68da60ad61519f19cfb0511ec2431ca54e2fcabf6fa985615ec06def5ba1b753e8ad96d0564aa4c
1084011028375c62fd132d5a4e41ffef2419da345b6595fba8a49b5136de59a884d878fc9789009843c49866a0dc97889242b9fb0b8c112f1423e3b220bc04a2d7dfbdff
==> 10850110.txt <==
108501103461fca7077fc2f0d895048606b828818047a64611ec94443e52cc2d39c968363359de5fc76df48e0bf3676b73b1f8fea5780c2af22c507f83331cc0fbfe6ea9
==> 1085022100.txt <==
1085022100a4ce8a09d1f28e78530ce940d6fcbd3c1fe2cb00e7b212b893ce78f8839a11868281179b4f2c812b8318f8d3f9a598b4da750a0ba6054d7e1b743bb67896ee62
==> 1086022100.txt <==
1086022100638681ade4b306295815221c5b445ba017943ae59c4c742f0b1442dae4902a56d173a6f859dc6088b6364224ec17c4e2213d9d3c96bd9992b696d7c13b234b50
==> 1088022100.txt <==
10880221005f0e261be654e4c52034d8d05b5c4dc0456b7868763367ab998b7d5886d64fbb24efd14cea668d00bfe8048eb8f096c3306bbb31aaea3e06710fa8c0bb8fca71
答案2
prefix.text(包含 8 个前缀)
1prefix
2prefix
3prefix
4prefix
x1prefix
x2prefix
x3prefix
x4prefix
input.text(例如 1 GB 的文本文件)
1prefix90956666
3prefix26588388
1prefix49080634
x3prefix59162307
x1prefix86437679
x4prefix77832956
x3prefix56458412
2prefix37484977
x2prefix73879936
x1prefix44005273
2prefix57156422
x1prefix67751608
4prefix25566629
x2prefix93657051
x3prefix40897616
4prefix93222501
3prefix35680804
x4prefix42979833
x2prefix08229240
1prefix42071365
4prefix67857600
2prefix66384962
x4prefix21482824
3prefix59616880
使用 grep 循环,为每个前缀写入 1 个输出文件
while read prefix
do
grep "^${prefix}" input.text > output_${prefix}.text
done < prefix.text
output_x1prefix.text(输出示例)
x1prefix86437679
x1prefix44005273
x1prefix67751608
答案3
这将在当前工作目录中为每个匹配的模式创建一个带有扩展名的新文件,.splt
并将所有匹配的行写入该文件:
sed
在 shellfor
循环中:
for i in "10830110" "1083021" "10840110" "1088022100" "10850110" "1085022100" "1086022100" # Patterns to match
do
sed -n "/^$i/p" FileName > "$i.splt" # Change "FileName" to your file name
done
您也可以awk
在 shellfor
循环中执行相同的操作:
for i in "10830110" "1083021" "10840110" "1088022100" "10850110" "1085022100" "1086022100" # Patterns to match
do
awk -v pat="^$i" -v pat2="$i" '$0 ~ pat { print $0 > pat2".splt"}' FileName # Change "FileName" to your file name
done
awk
具有一系列模式:
awk '{pat["0"] = "10830110";
pat["1"] = "1083021";
pat["2"] = "10840110";
pat["3"] = "1088022100";
pat["4"] = "10850110";
pat["5"] = "1085022100";
pat["6"] = "1086022100";} {for (i in pat) { if ($0 ~ "^"pat[i]) print $0 > pat[i]".splt"}}' YourFile
或将图案保存为线条(每个模式占一行) 在一个pat.txt
文件中并awk
构建模式数组,如下所示:
awk 'FILENAME=="pat.txt" { pat[$i]=$0; next } { for (i in pat) { if ($0 ~ "^"pat[i]) print $0 > pat[i]".splt"}}' pat.txt YourFile
速度测试(为了科学)
我测试了解决方案(每次测试 3 次,取四舍五入的平均值)在我的回答以及答案中提供@steeldriver和@谢尔顿结果如下(在相同平均配置的 PC 上测试) 具有相同的模式,pat.txt
包含:
$ cat pat.txt
10830110
1083021
10840110
1088022100
10850110
1085022100
1086022100
并且数据文件file.dat
包含通过复制 OP 提供的示例中的行而生成的行,即1.1G
:8,484,000
1083021106e581c71003b987a75f18543cf5858b9fcfc5e04c0dddd79cd18764a865ba86d027de6d1900dc171e4d90a0564abbce99b812b821bd0d7d37aad72ead19c17
10840110dbd43121ef0c51a8ba62193eac247f57f1909e270eeb53d68da60ad61519f19cfb0511ec2431ca54e2fcabf6fa985615ec06def5ba1b753e8ad96d0564aa4c
1084011028375c62fd132d5a4e41ffef2419da345b6595fba8a49b5136de59a884d878fc9789009843c49866a0dc97889242b9fb0b8c112f1423e3b220bc04a2d7dfbdff
10880221005f0e261be654e4c52034d8d05b5c4dc0456b7868763367ab998b7d5886d64fbb24efd14cea668d00bfe8048eb8f096c3306bbb31aaea3e06710fa8c0bb8fca71
108501103461fca7077fc2f0d895048606b828818047a64611ec94443e52cc2d39c968363359de5fc76df48e0bf3676b73b1f8fea5780c2af22c507f83331cc0fbfe6ea9
1085022100a4ce8a09d1f28e78530ce940d6fcbd3c1fe2cb00e7b212b893ce78f8839a11868281179b4f2c812b8318f8d3f9a598b4da750a0ba6054d7e1b743bb67896ee62
1086022100638681ade4b306295815221c5b445ba017943ae59c4c742f0b1442dae4902a56d173a6f859dc6088b6364224ec17c4e2213d9d3c96bd9992b696d7c13b234b50
结果按从快到快的顺序排列,每个结果下都提供了我用于计时的代码:
#1grep
在 shell 循环中@Sheldon (18 秒)
s=$(date +%s); while read prefix
do
grep "^${prefix}" file.dat > ${prefix}.splt
done < pat.txt; e=$(date +%s); echo $(($e-$s))
更准确的计时:
$ time (while read prefix
do
grep "^${prefix}" file.dat > ${prefix}.splt
done < pat.txt)
real 0m17.969s
user 0m4.437s
sys 0m2.176s
#2 sed
@steeldriver(20 秒)
s=$(date +%s); sed 's:.*:/&/w&.splt:' pat.txt | sed -n -f - file.dat; e=$(date +%s); echo $(($e-$s))
更准确的计时与^
增加响应@terdon 的评论:
$ time (sed 's:.*:/^&/w&.splt:' pat.txt | sed -n -f - file.dat)
real 0m18.748s
user 0m10.408s
sys 0m1.546s
#3sed
在 shell 循环中@Raffa (21 秒)
s=$(date +%s); for i in "10830110" "1083021" "10840110" "1088022100" "10850110" "1085022100" "1086022100" # Patterns to match
do
sed -n "/^$i/p" file.dat > "$i.splt" # Change "FileName" to your file name
done; e=$(date +%s); echo $(($e-$s))
#4awk
在 shell 循环中@Raffa (35 秒)
s=$(date +%s); for i in "10830110" "1083021" "10840110" "1088022100" "10850110" "1085022100" "1086022100" # Patterns to match
do
awk -v pat="^$i" -v pat2="$i" '$0 ~ pat { print $0 > pat2".splt"}' file.dat # Change "FileName" to your file name
done; e=$(date +%s); echo $(($e-$s))
#5 awk
@Raffa (414 秒) <-- 这太令人震惊了
s=$(date +%s); awk '{pat["0"] = "10830110";
pat["1"] = "1083021";
pat["2"] = "10840110";
pat["3"] = "1088022100";
pat["4"] = "10850110";
pat["5"] = "1085022100";
pat["6"] = "1086022100";} {for (i in pat) { if ($0 ~ "^"pat[i]) print $0 > pat[i]".splt"}}' file.dat; e=$(date +%s); echo $(($e-$s))