如何从 .txt 文件中的数据集中删除重复行,因为问题是我的行的一半内容是重复的,而我只想要第一个 RESULT 行。我尝试使用该命令,但这效果不佳
sort myfile.txt uniq -u | newfile.txt
这是我的文件的内容,我的文件的行数是 299873
ligand_06278/out.pdbqt:REMARK Name = 22626427
ligand_06278/out.pdbqt:REMARK VINA RESULT: -8.3 2.094 2.612
ligand_06278/out.pdbqt:REMARK VINA RESULT: -8.3 2.821 8.000
ligand_06278/out.pdbqt:REMARK VINA RESULT: -8.4 3.333 6.628
ligand_06278/out.pdbqt:REMARK VINA RESULT: -8.4 4.526 7.557
ligand_06278/out.pdbqt:REMARK VINA RESULT: -8.5 2.500 4.835
ligand_06278/out.pdbqt:REMARK VINA RESULT: -8.5 2.516 7.135
ligand_06278/out.pdbqt:REMARK VINA RESULT: -8.6 2.660 7.148
ligand_06278/out.pdbqt:REMARK VINA RESULT: -8.8 3.141 6.023
ligand_06278/out.pdbqt:REMARK VINA RESULT: -8.9 0.000 0.000
ligand_06279/out.pdbqt:REMARK Name = 22629712
ligand_06279/out.pdbqt:REMARK VINA RESULT: -6.1 9.841 13.115
ligand_06279/out.pdbqt:REMARK VINA RESULT: -6.3 15.483 18.543
ligand_06279/out.pdbqt:REMARK VINA RESULT: -6.3 1.944 5.962
ligand_06279/out.pdbqt:REMARK VINA RESULT: -6.3 8.946 12.260
ligand_06279/out.pdbqt:REMARK VINA RESULT: -6.5 14.453 17.240
ligand_06279/out.pdbqt:REMARK VINA RESULT: -6.8 10.330 14.145
ligand_06279/out.pdbqt:REMARK VINA RESULT: -6.8 1.727 5.848
ligand_06279/out.pdbqt:REMARK VINA RESULT: -7.1 7.429 11.509
ligand_06279/out.pdbqt:REMARK VINA RESULT: -7.3 0.000 0.000
ligand_06280/out.pdbqt:REMARK Name = 22631372
ligand_06280/out.pdbqt:REMARK VINA RESULT: -10.0 3.811 7.264
ligand_06280/out.pdbqt:REMARK VINA RESULT: -10.1 0.000 0.000
ligand_06280/out.pdbqt:REMARK VINA RESULT: -9.3 5.006 9.020
ligand_06280/out.pdbqt:REMARK VINA RESULT: -9.4 2.195 8.687
ligand_06280/out.pdbqt:REMARK VINA RESULT: -9.4 2.712 9.301
ligand_06280/out.pdbqt:REMARK VINA RESULT: -9.6 2.186 8.354
ligand_06280/out.pdbqt:REMARK VINA RESULT: -9.7 5.168 7.981
ligand_06280/out.pdbqt:REMARK VINA RESULT: -9.8 1.961 2.580
ligand_06280/out.pdbqt:REMARK VINA RESULT: -9.8 2.311 8.341
答案1
$ awk -F: '$1 != p && /RESULT/ { print; p = $1 }' file
ligand_06278/out.pdbqt:REMARK VINA RESULT: -8.3 2.094 2.612
ligand_06279/out.pdbqt:REMARK VINA RESULT: -6.1 9.841 13.115
ligand_06280/out.pdbqt:REMARK VINA RESULT: -10.0 3.811 7.264
RESULT
这将输出输入文件中提到的每个文件的第一行。它通过将第一列(文件名)与上一行的第一列进行比较并测试当前行是否包含单词 来实现此目的RESULT
。当找到与前一行具有不同文件名的行时,将按原样打印该行,并更新RESULT
的值。p
我注意到输入文件看起来非常像grep
运行多个文件的结果,可能是 grep 查找REMARK
.
要查找所有文件并获取每个文件的第一行匹配REMARK VINA RESULT
:
find . -type f -path './ligand_*' -name 'out.pdbqt' -exec sed -n '/REMARK VINA RESULT/{p;q;}' {} ';'
或者,作为一个简单的循环:
for name in ligand_*/out.pdbqt; do
grep -F 'REMARK VINA RESULT' "$name" | head -n 1
done
我在这里使用了不同的方法,您选择感觉最自然的一种。
答案2
您可以简单地grep
包含Name=
以下行 ( -A1
) 并传递输出grep RESULT
以删除这些Name=
行:
$ grep -A1 'Name =' file | grep RESULT
ligand_06278/out.pdbqt:REMARK VINA RESULT: -8.3 2.094 2.612
ligand_06279/out.pdbqt:REMARK VINA RESULT: -6.1 9.841 13.115
ligand_06280/out.pdbqt:REMARK VINA RESULT: -10.0 3.811 7.264