我想做一些相当复杂的事情,但我会尝试尽可能简单地解释。我的 Linux 机器上有许多具有不同名称的目录,但它们都具有相同的格式(例如1trg_A -> ????_?
),并且在每个文件夹中我都有一个或多个以相同方式调用的文件(它们之间的参考代码有所不同,并链接到名称)文件夹:例如Pocket_001_1trg_A.pdb_OUTPUT.txt
)。因此,在每个文件夹中,????_?
我都有一个或多个名为的文件,Pocket_***_????_?.pdb_OUTPUT.txt
每个文件如下:
input PDB-File: /home/tommaso/Desktop/E3-ligase/CHAINS-approach/chains/1trg_A/1b47_A.pdb
Pocket File: /home/tommaso/Desktop/E3-ligase/CHAINS-approach/chains/1trg_A/Pocket_001_1trg_A.pdb
Pocket Surface: 460.7
Hydrophobic Surface: 54.6 (11.8%)
Polar Surface: 291.4 (63.2%)
Acceptor Surface: 226.7 (49.2%)
Donnor Surface: 163.7 (35.5%)
Exposed To Solvent: 133.3 (28.9%)
Pocket Volume: 1044.6
Hydrophobic Volume: 11.0 ( 1.1%)
Hydrophilic Volume: 199.1 (19.1%)
Flexible Volume: 203.3 (19.5%)
Rigid Volume: 51.9 ( 5.0%)
Buried Volume(B): 32.5 ( 3.1%)
Buried Volume(A): 0.0 ( 0.0%)
对于每个目录,我想从每个文件中提取口袋的名称(例如Pocket_001_1trg_A.pdb
)和“暴露于溶剂”的值(Pocket_001_1trg_A.pdb 133.3 28.9%)。
必须对每个文件执行此操作,并将所有数据放入一个输出文件中。例如,我们假设只有两个不同的目录(2ert_B
和6yus_1
),其中有两个不同的文件(Pocket_001_2ert_B.pdb_OUTPUT.txt
和Pocket_003_6yus_1.pdb_OUTPUT.txt
)。
Pocket_001_2ert_B.pdb_OUTPUT.txt:
input PDB-File: /home/tommaso/Desktop/E3-ligase/CHAINS-approach/chains/2ert_B/2ert_B.pdb
Pocket File: /home/tommaso/Desktop/E3-ligase/CHAINS-approach/chains/2ert_B/Pocket_001_2ert_B.pdb
Pocket Surface: 460.7
Hydrophobic Surface: 54.6 (11.8%)
Polar Surface: 291.4 (63.2%)
Acceptor Surface: 226.7 (49.2%)
Donnor Surface: 163.7 (35.5%)
Exposed To Solvent: 125.4 (49.9%)
Pocket Volume: 1044.6
Hydrophobic Volume: 11.0 ( 1.1%)
Hydrophilic Volume: 199.1 (19.1%)
Flexible Volume: 203.3 (19.5%)
Rigid Volume: 51.9 ( 5.0%)
Buried Volume(B): 32.5 ( 3.1%)
Buried Volume(A): 0.0 ( 0.0%)
Pocket_003_6yus_1.pdb_OUTPUT.txt:
input PDB-File: /home/tommaso/Desktop/E3-ligase/CHAINS-approach/chains/6yus_1/26yus_1.pdb
Pocket File: /home/tommaso/Desktop/E3-ligase/CHAINS-approach/chains/6yus_1/Pocket_003_6yus_1.pdb
Pocket Surface: 460.7
Hydrophobic Surface: 54.6 (11.8%)
Polar Surface: 291.4 (63.2%)
Acceptor Surface: 226.7 (49.2%)
Donnor Surface: 163.7 (35.5%)
Exposed To Solvent: 45.3 (22.4%)
Pocket Volume: 1044.6
Hydrophobic Volume: 11.0 ( 1.1%)
Hydrophilic Volume: 199.1 (19.1%)
Flexible Volume: 203.3 (19.5%)
Rigid Volume: 51.9 ( 5.0%)
Buried Volume(B): 32.5 ( 3.1%)
Buried Volume(A): 0.0 ( 0.0%)
文件“output.txt”将如下所示:
Pocket_001_2ert_B.pdb 125.4 49.9%
Pocket_003_6yus_1.pdb 45.3 22.4%
我不知道该怎么做,我希望它很清楚,并且有比我更有经验的人可以帮助我。谢谢。
答案1
假设您正在 bash 中工作,使用 gnu grep 和 sed 以及 3 个目录:
$ ls
1trg_A 2ert_B 6yus_1
您可以在 bash 中使用 globstar 功能 (**)
$ ls **/Pocket_*.pdb_OUTPUT.txt
1trg_A/Pocket_001_1trg_A.pdb_OUTPUT.txt 2ert_B/Pocket_001_2ert_B.pdb_OUTPUT.txt 6yus_1/Pocket_003_6yus_1.pdb_OUTPUT.txt
所以现在,您需要做的就是使用 grep 找到您想要的行
$ grep -e '^Exposed To Solvent:' **/Pocket_*.pdb_OUTPUT.txt
1trg_A/Pocket_001_1trg_A.pdb_OUTPUT.txt:Exposed To Solvent: 133.3 (28.9%)
2ert_B/Pocket_001_2ert_B.pdb_OUTPUT.txt:Exposed To Solvent: 125.4 (49.9%)
6yus_1/Pocket_003_6yus_1.pdb_OUTPUT.txt:Exposed To Solvent: 45.3 (22.4%)
然后你必须使用 sed 修改提取的行。完整的命令看起来像
$ grep -e '^Exposed To Solvent:' **/Pocket_*.pdb_OUTPUT.txt | sed -e 's/^.*\(Pocket.*\.pdb\).*:/\1/;s/[()]//g' >myfile
$ cat myfile
Pocket_001_1trg_A.pdb 133.3 28.9%
Pocket_001_2ert_B.pdb 125.4 49.9%
Pocket_003_6yus_1.pdb 45.3 22.4%
注意:我假设 2ert_B 文件夹中没有文件 Pocket_001_1trg_A.pdb_OUTPUT.txt。
答案2
我不由自主地伸手find
去拿awk
这个
find . -type f -name "Pocket*.pdb" -exec awk -F: '$1~"Solvent"{last = split(FILENAME, bits, "/"); print bits[last],$2}' {} \;
Pocket_003_6yus_1.pdb 45.3 (22.4%)
Pocket_001_2ert_B.pdb 125.4 (49.9%)
演练
从与模式匹配的所有.
find
文件中并使用以下代码-type f
-name
"Pocket*.pdb"
-exec
find . -type f -name "Pocket*.pdb" -exec
awk
-F
使用字段分隔符遍历每个文件(逐行),:
直到找到您感兴趣的文本为止$1~"Solvent"
awk -F: '$1~"Solvent"{
一旦找到匹配项,split
就会FILENAME
存储bits[]
它/
被切分成的位数last
,然后print
是文件名(在 中bits[last]
)以及您想要的数据$2
,然后exit
as 不需要进一步处理。对单独找到的awk
每个文件单独执行此操作{}
\;
last = split(FILENAME, bits, "/"); print bits[last],$2; exit 1}' {} \;
但毕竟打字之后@renaudglobstar
与awk
awk -F: '$1~"Solvent"{last = split(FILENAME, bits, "/"); print bits[last],$2; exit 1}' **/Pocket*.pdb
Pocket_001_2ert_B.pdb 125.4 (49.9%)
Pocket_003_6yus_1.pdb 45.3 (22.4%)
只是顺序不同而已。