如何区分文件中的各个内容?

如何区分文件中的各个内容?

我有一个文本文件(大小超过 1GB),其中包含如下行:

1083021106e581c71003b987a75f18543cf5858b9fcfc5e04c0dddd79cd18764a865ba86d027de6d1900dc171e4d90a0564abbce99b812b821bd0d7d37aad72ead19c17
10840110dbd43121ef0c51a8ba62193eac247f57f1909e270eeb53d68da60ad61519f19cfb0511ec2431ca54e2fcabf6fa985615ec06def5ba1b753e8ad96d0564aa4c
1084011028375c62fd132d5a4e41ffef2419da345b6595fba8a49b5136de59a884d878fc9789009843c49866a0dc97889242b9fb0b8c112f1423e3b220bc04a2d7dfbdff
10880221005f0e261be654e4c52034d8d05b5c4dc0456b7868763367ab998b7d5886d64fbb24efd14cea668d00bfe8048eb8f096c3306bbb31aaea3e06710fa8c0bb8fca71
108501103461fca7077fc2f0d895048606b828818047a64611ec94443e52cc2d39c968363359de5fc76df48e0bf3676b73b1f8fea5780c2af22c507f83331cc0fbfe6ea9
1085022100a4ce8a09d1f28e78530ce940d6fcbd3c1fe2cb00e7b212b893ce78f8839a11868281179b4f2c812b8318f8d3f9a598b4da750a0ba6054d7e1b743bb67896ee62
1086022100638681ade4b306295815221c5b445ba017943ae59c4c742f0b1442dae4902a56d173a6f859dc6088b6364224ec17c4e2213d9d3c96bd9992b696d7c13b234b50

所有以如下内容开头的字符串,.....

10830110
1083021
10840110
10840110
1088022100
10850110
1085022100
1086022100

我需要分离 8 个文件,如何使用 sed 命令

答案1

您可以使用 sed 将前缀文件转换为 sed 命令文件,然后在 sed 命令中使用它来处理大文件 - 这几乎肯定比使用 shell 循环对同一(大)文件多次运行 sed(或 grep)更有效。例如给定

$ cat file2
10830110
1083021
10840110
10840110
1088022100
10850110
1085022100
1086022100

然后

$ sed 's:.*:/^&/w&.txt:' file2
/10830110/w10830110.txt
/1083021/w1083021.txt
/10840110/w10840110.txt
/10840110/w10840110.txt
/1088022100/w1088022100.txt
/10850110/w10850110.txt
/1085022100/w1085022100.txt
/1086022100/w1086022100.txt

以便

$ sed 's:.*:/^&/w&.txt:' file2 | sed -n -f - file1

生产

$ head 108*.txt
==> 10830110.txt <==

==> 1083021.txt <==
1083021106e581c71003b987a75f18543cf5858b9fcfc5e04c0dddd79cd18764a865ba86d027de6d1900dc171e4d90a0564abbce99b812b821bd0d7d37aad72ead19c17

==> 10840110.txt <==
10840110dbd43121ef0c51a8ba62193eac247f57f1909e270eeb53d68da60ad61519f19cfb0511ec2431ca54e2fcabf6fa985615ec06def5ba1b753e8ad96d0564aa4c
10840110dbd43121ef0c51a8ba62193eac247f57f1909e270eeb53d68da60ad61519f19cfb0511ec2431ca54e2fcabf6fa985615ec06def5ba1b753e8ad96d0564aa4c
1084011028375c62fd132d5a4e41ffef2419da345b6595fba8a49b5136de59a884d878fc9789009843c49866a0dc97889242b9fb0b8c112f1423e3b220bc04a2d7dfbdff
1084011028375c62fd132d5a4e41ffef2419da345b6595fba8a49b5136de59a884d878fc9789009843c49866a0dc97889242b9fb0b8c112f1423e3b220bc04a2d7dfbdff

==> 10850110.txt <==
108501103461fca7077fc2f0d895048606b828818047a64611ec94443e52cc2d39c968363359de5fc76df48e0bf3676b73b1f8fea5780c2af22c507f83331cc0fbfe6ea9

==> 1085022100.txt <==
1085022100a4ce8a09d1f28e78530ce940d6fcbd3c1fe2cb00e7b212b893ce78f8839a11868281179b4f2c812b8318f8d3f9a598b4da750a0ba6054d7e1b743bb67896ee62

==> 1086022100.txt <==
1086022100638681ade4b306295815221c5b445ba017943ae59c4c742f0b1442dae4902a56d173a6f859dc6088b6364224ec17c4e2213d9d3c96bd9992b696d7c13b234b50

==> 1088022100.txt <==
10880221005f0e261be654e4c52034d8d05b5c4dc0456b7868763367ab998b7d5886d64fbb24efd14cea668d00bfe8048eb8f096c3306bbb31aaea3e06710fa8c0bb8fca71

您可能希望首先对模式文件进行重复数据删除 - 并可能按数字排序并修改第二个 sed 命令以在第一次匹配后中断,以便您只匹配最长字首:

$ sort -nru file2 | sed 's:.*:/^&/{w&.txt\nb\n}:' | sed -n -f - file1

给予

$ head 108*.txt
==> 10830110.txt <==

==> 1083021.txt <==
1083021106e581c71003b987a75f18543cf5858b9fcfc5e04c0dddd79cd18764a865ba86d027de6d1900dc171e4d90a0564abbce99b812b821bd0d7d37aad72ead19c17

==> 10840110.txt <==
10840110dbd43121ef0c51a8ba62193eac247f57f1909e270eeb53d68da60ad61519f19cfb0511ec2431ca54e2fcabf6fa985615ec06def5ba1b753e8ad96d0564aa4c
1084011028375c62fd132d5a4e41ffef2419da345b6595fba8a49b5136de59a884d878fc9789009843c49866a0dc97889242b9fb0b8c112f1423e3b220bc04a2d7dfbdff

==> 10850110.txt <==
108501103461fca7077fc2f0d895048606b828818047a64611ec94443e52cc2d39c968363359de5fc76df48e0bf3676b73b1f8fea5780c2af22c507f83331cc0fbfe6ea9

==> 1085022100.txt <==
1085022100a4ce8a09d1f28e78530ce940d6fcbd3c1fe2cb00e7b212b893ce78f8839a11868281179b4f2c812b8318f8d3f9a598b4da750a0ba6054d7e1b743bb67896ee62

==> 1086022100.txt <==
1086022100638681ade4b306295815221c5b445ba017943ae59c4c742f0b1442dae4902a56d173a6f859dc6088b6364224ec17c4e2213d9d3c96bd9992b696d7c13b234b50

==> 1088022100.txt <==
10880221005f0e261be654e4c52034d8d05b5c4dc0456b7868763367ab998b7d5886d64fbb24efd14cea668d00bfe8048eb8f096c3306bbb31aaea3e06710fa8c0bb8fca71

答案2

prefix.text(包含 8 个前缀)

1prefix
2prefix
3prefix
4prefix
x1prefix
x2prefix
x3prefix
x4prefix

input.text(例如 1 GB 的文本文件)

1prefix90956666
3prefix26588388
1prefix49080634
x3prefix59162307
x1prefix86437679
x4prefix77832956
x3prefix56458412
2prefix37484977
x2prefix73879936
x1prefix44005273
2prefix57156422
x1prefix67751608
4prefix25566629
x2prefix93657051
x3prefix40897616
4prefix93222501
3prefix35680804
x4prefix42979833
x2prefix08229240
1prefix42071365
4prefix67857600
2prefix66384962
x4prefix21482824
3prefix59616880

使用 grep 循环,为每个前缀写入 1 个输出文件

while read prefix
do
    grep "^${prefix}" input.text > output_${prefix}.text
done < prefix.text

output_x1prefix.text(输出示例)

x1prefix86437679
x1prefix44005273
x1prefix67751608

答案3

这将在当前工作目录中为每个匹配的模式创建一个带有扩展名的新文件,.splt并将所有匹配的行写入该文件:

sed在 shellfor循环中:

for i in "10830110" "1083021" "10840110" "1088022100" "10850110" "1085022100" "1086022100" # Patterns to match
    do
    sed -n "/^$i/p" FileName > "$i.splt" # Change "FileName" to your file name
    done

您也可以awk在 shellfor循环中执行相同的操作:

for i in "10830110" "1083021" "10840110" "1088022100" "10850110" "1085022100" "1086022100" # Patterns to match
    do
    awk -v pat="^$i" -v pat2="$i" '$0 ~ pat { print $0 > pat2".splt"}' FileName # Change "FileName" to your file name
    done

awk具有一系列模式:

awk '{pat["0"] = "10830110";
    pat["1"] = "1083021";
    pat["2"] = "10840110";
    pat["3"] = "1088022100";
    pat["4"] = "10850110";
    pat["5"] = "1085022100";
    pat["6"] = "1086022100";} {for (i in pat) { if ($0 ~ "^"pat[i]) print $0 > pat[i]".splt"}}' YourFile

或将图案保存为线条(每个模式占一行) 在一个pat.txt文件中并awk构建模式数组,如下所示:

awk 'FILENAME=="pat.txt" { pat[$i]=$0; next } { for (i in pat) { if ($0 ~ "^"pat[i]) print $0 > pat[i]".splt"}}' pat.txt YourFile

速度测试(为了科学

我测试了解决方案(每次测试 3 次,取四舍五入的平均值)在我的回答以及答案中提供@steeldriver@谢尔顿结果如下(在相同平均配置的 PC 上测试) 具有相同的模式,pat.txt包含:

$ cat pat.txt 
10830110
1083021
10840110
1088022100
10850110
1085022100
1086022100

并且数据文件file.dat包含通过复制 OP 提供的示例中的行而生成的行,即1.1G8,484,000

1083021106e581c71003b987a75f18543cf5858b9fcfc5e04c0dddd79cd18764a865ba86d027de6d1900dc171e4d90a0564abbce99b812b821bd0d7d37aad72ead19c17
10840110dbd43121ef0c51a8ba62193eac247f57f1909e270eeb53d68da60ad61519f19cfb0511ec2431ca54e2fcabf6fa985615ec06def5ba1b753e8ad96d0564aa4c
1084011028375c62fd132d5a4e41ffef2419da345b6595fba8a49b5136de59a884d878fc9789009843c49866a0dc97889242b9fb0b8c112f1423e3b220bc04a2d7dfbdff
10880221005f0e261be654e4c52034d8d05b5c4dc0456b7868763367ab998b7d5886d64fbb24efd14cea668d00bfe8048eb8f096c3306bbb31aaea3e06710fa8c0bb8fca71
108501103461fca7077fc2f0d895048606b828818047a64611ec94443e52cc2d39c968363359de5fc76df48e0bf3676b73b1f8fea5780c2af22c507f83331cc0fbfe6ea9
1085022100a4ce8a09d1f28e78530ce940d6fcbd3c1fe2cb00e7b212b893ce78f8839a11868281179b4f2c812b8318f8d3f9a598b4da750a0ba6054d7e1b743bb67896ee62
1086022100638681ade4b306295815221c5b445ba017943ae59c4c742f0b1442dae4902a56d173a6f859dc6088b6364224ec17c4e2213d9d3c96bd9992b696d7c13b234b50

结果按从快到快的顺序排列,每个结果下都提供了我用于计时的代码:

#1grep在 shell 循环中@Sheldon (18 秒)

s=$(date +%s); while read prefix
do
    grep "^${prefix}" file.dat > ${prefix}.splt
done < pat.txt; e=$(date +%s); echo $(($e-$s))

更准确的计时:

$ time (while read prefix
do
    grep "^${prefix}" file.dat > ${prefix}.splt
done < pat.txt)

real    0m17.969s
user    0m4.437s
sys     0m2.176s

#2 sed@steeldriver(20 秒)

s=$(date +%s); sed 's:.*:/&/w&.splt:' pat.txt | sed -n -f - file.dat; e=$(date +%s); echo $(($e-$s))

更准确的计时与^增加响应@terdon 的评论

$ time (sed 's:.*:/^&/w&.splt:' pat.txt | sed -n -f - file.dat)

real    0m18.748s
user    0m10.408s
sys     0m1.546s

#3sed在 shell 循环中@Raffa (21 秒)

s=$(date +%s); for i in "10830110" "1083021" "10840110" "1088022100" "10850110" "1085022100" "1086022100" # Patterns to match
    do
    sed -n "/^$i/p" file.dat > "$i.splt" # Change "FileName" to your file name
    done; e=$(date +%s); echo $(($e-$s))

#4awk在 shell 循环中@Raffa (35 秒)

s=$(date +%s); for i in "10830110" "1083021" "10840110" "1088022100" "10850110" "1085022100" "1086022100" # Patterns to match
    do
    awk -v pat="^$i" -v pat2="$i" '$0 ~ pat { print $0 > pat2".splt"}' file.dat # Change "FileName" to your file name
    done; e=$(date +%s); echo $(($e-$s))

#5 awk@Raffa (414 秒) <-- 这太令人震惊了

s=$(date +%s); awk '{pat["0"] = "10830110";
    pat["1"] = "1083021";
    pat["2"] = "10840110";
    pat["3"] = "1088022100";
    pat["4"] = "10850110";
    pat["5"] = "1085022100";
    pat["6"] = "1086022100";} {for (i in pat) { if ($0 ~ "^"pat[i]) print $0 > pat[i]".splt"}}' file.dat; e=$(date +%s); echo $(($e-$s))

答案4

如果文件已经拆分成以这些字符串开头的行(如示例中所示),则可以awk按如下方式使用(参考):

 awk '{file="file."(++i)".txt"}{print > file;}' input-file.txt

这将为每一行生成一个新文件。

如果我们假设起始字符串的长度固定为 7 个字符(示例中并非如此),我们可以将输入文件拆分为针对每个起始字符串的单独文件,例如(参考):

awk '{file="file."(substr($1,1,7))".txt"}{print >> file;}' input-file.txt

相关内容