首先要做的事情是:不需要 slurm 或 Infiniband 的知识 - 这是一个纯粹的文本处理问题。
第二——我知道ib2slurm- 代码在某种程度上被破坏并且很可能已经过时 - 每次运行时都会进行核心转储,无论映射文件是否存在或格式如何。
我可以将 ibnetdiscover 的输出减少到每种形式 37 行块:
Switch 36 "S-0002c90200423e70" # "MF0;ibsw20:SX6036/U1" enhanced port 0 lid 3 lmc 0
[1] "H-0002c903000c26f2"[1](2c903000c26f3) # "compute061 HCA-1" lid 49 4xQDR
[2] "H-0002c903000bf36e"[1](2c903000bf36f) # "compute060 HCA-1" lid 1 4xQDR
[3] "H-0002c903000bf35a"[1](2c903000bf35b) # "compute063 HCA-1" lid 28 4xQDR
[4] "H-0002c903000c2646"[1](2c903000c2647) # "compute062 HCA-1" lid 25 4xQDR
[5] "H-0002c903000bf35e"[1](2c903000bf35f) # "compute064 HCA-1" lid 31 4xQDR
[6] "H-0002c903000c26de"[1](2c903000c26df) # "compute065 HCA-1" lid 47 4xQDR
[7] "S-0002c90200423e80"[31] # "Infiniscale-IV Mellanox Technologies" lid 6 4xQDR
[8] "S-0002c90200423e80"[32] # "Infiniscale-IV Mellanox Technologies" lid 6 4xQDR
[9] "S-0002c90200423e80"[33] # "Infiniscale-IV Mellanox Technologies" lid 6 4xQDR
[10] "S-0002c90200423e80"[34] # "Infiniscale-IV Mellanox Technologies" lid 6 4xQDR
[11] "S-0002c90200423e80"[35] # "Infiniscale-IV Mellanox Technologies" lid 6 4xQDR
[12] "S-0002c90200423e80"[36] # "Infiniscale-IV Mellanox Technologies" lid 6 4xQDR
[13] "S-0002c90200423eb8"[35] # "Infiniscale-IV Mellanox Technologies" lid 11 4xQDR
[14] "S-0002c90200423eb8"[36] # "Infiniscale-IV Mellanox Technologies" lid 11 4xQDR
[15] "S-0002c90200423eb8"[33] # "Infiniscale-IV Mellanox Technologies" lid 11 4xQDR
[16] "S-0002c90200423eb8"[34] # "Infiniscale-IV Mellanox Technologies" lid 11 4xQDR
[17] "S-0002c90200423eb8"[31] # "Infiniscale-IV Mellanox Technologies" lid 11 4xQDR
[18] "S-0002c90200423eb8"[32] # "Infiniscale-IV Mellanox Technologies" lid 11 4xQDR
[19] "S-0002c90200423ee0"[31] # "Infiniscale-IV Mellanox Technologies" lid 15 4xQDR
[20] "S-0002c90200423ee0"[32] # "Infiniscale-IV Mellanox Technologies" lid 15 4xQDR
[21] "S-0002c90200423ee0"[33] # "Infiniscale-IV Mellanox Technologies" lid 15 4xQDR
[22] "S-0002c90200423ee0"[34] # "Infiniscale-IV Mellanox Technologies" lid 15 4xQDR
[23] "S-0002c90200423ee0"[35] # "Infiniscale-IV Mellanox Technologies" lid 15 4xQDR
[24] "S-0002c90200423ee0"[36] # "Infiniscale-IV Mellanox Technologies" lid 15 4xQDR
[25] "H-0002c903000c26fa"[1](2c903000c26fb) # "compute046 HCA-1" lid 112 4xQDR
[26] "H-0002c903000c26e2"[1](2c903000c26e3) # "compute047 HCA-1" lid 63 4xQDR
[27] "H-0002c903000c263a"[1](2c903000c263b) # "compute048 HCA-1" lid 59 4xQDR
[28] "H-0002c903000c27c2"[1](2c903000c27c3) # "compute049 HCA-1" lid 117 4xQDR
[29] "H-0002c903000c27a6"[1](2c903000c27a7) # "compute051 HCA-1" lid 34 4xQDR
[30] "H-0002c903000c2732"[1](2c903000c2733) # "compute050 HCA-1" lid 22 4xQDR
[31] "H-0002c903000c265e"[1](2c903000c265f) # "compute052 HCA-1" lid 29 4xQDR
[32] "H-0002c903000c266a"[1](2c903000c266b) # "compute055 HCA-1" lid 32 4xQDR
[33] "H-0002c903000c264e"[1](2c903000c264f) # "compute054 HCA-1" lid 26 4xQDR
[34] "H-0002c903000c26ee"[1](2c903000c26ef) # "compute056 HCA-1" lid 48 4xQDR
[35] "H-0002c903000bf246"[1](2c903000bf247) # "compute057 HCA-1" lid 33 4xQDR
[36] "H-0002c903000c27ca"[1](2c903000c27cb) # "compute053 HCA-1" lid 44 4xQDR
并且可以使用 awk 或 sed 提取节点名称,例如compute061。
我想为每个块获取一行,以开关名称开头,后跟节点名称,即:
ibsw20 compute061 compute060 compute063 compute062 compute064 compute065 compute046 compute047 compute048 compute049 compute051 compute050 compute052 compute055 compute054 compute056 compute057 compute053
我计划使用 slurmscontrol show hostlist "<nodename> <nodename> ..."
将多个节点压缩为单个实体,以推送到 slurm 的 topology.conf 文件中,该文件必须具有以下形式:
SwitchName=ibsw20 Nodes=compute[046-057,060-061]
有任何想法吗?
我应该提到的是,在完成所有交换机映射之后,ibnetdiscover 文件继续进行相反的操作 - 逐节点映射到交换机,其形式为:
vendid=0x2c9
devid=0x673c
sysimgguid=0x2c903000bf371
caguid=0x2c903000bf36e
Ca 1 "H-0002c903000bf36e" # "compute060 HCA-1"
[1](2c903000bf36f) "S-0002c90200423e70"[2] # lid 1 lmc 0 "MF0;ibsw20:SX6036/U1" lid 3 4xQDR
每个块由空行分隔。
一个可以让我开始的简化问题 - 如何将多行文本解析为一行,提取每行的不同部分(以不同的方式处理标题行和正文行)并丢弃不包含相关数据的行?
编辑:这些块可能未满 - 如果某些交换机中的某些端口没有连接任何内容,则输出将跳过这些行,并可能导致类似以下结果:
Switch 36 "S-0002c90200423e70" # "MF0;ibsw20:SX6036/U1" enhanced port 0 lid 3 lmc 0
[2] "H-0002c903000bf36e"[1](2c903000bf36f) # "compute060 HCA-1" lid 1 4xQDR
[3] "H-0002c903000bf35a"[1](2c903000bf35b) # "compute063 HCA-1" lid 28 4xQDR
[4] "H-0002c903000c2646"[1](2c903000c2647) # "compute062 HCA-1" lid 25 4xQDR
[15] "S-0002c90200423eb8"[33] # "Infiniscale-IV Mellanox Technologies" lid 11 4xQDR
[33] "H-0002c903000c264e"[1](2c903000c264f) # "compute074 HCA-1" lid 26 4xQDR
[34] "H-0002c903000c26ee"[1](2c903000c26ef) # "compute076 HCA-1" lid 48 4xQDR
因此,我不能简单地依赖每条开关线后面有 36 条线,或者 [36] 始终是开关块中的最后一行。
答案1
Q1
此 awk 命令从文件中提取唯一计算机名称的排序列表,假设:
源文件要长得多,每个开关都有一个行块。
用于对整个开关块(假设开关行始终是每个开关的连续行集的第一行)进行排序并删除重复节点的脚本是:
awk -v FS='[#"]' '
BEGIN{c=0}
$1~/Switch/ {c++; j=0; split($5,arr,"[;:]" ); sw[c,0]=arr[2] }
$1~/\[[0-9]+\]/ { j++; split($5,arr," " ); sw[c,j]=arr[1] }
END {
print("final count of switches=" c)
for (i=1; i<=c; i++) {
print( "switch=" i, sw[i,0] ) # show switch number.
split("", out , ":" ) # delete array "out".
split("", indices , ":" ) # delete array "indices".
j=0
while (sw[i,++j]) { # for all array elements.
if (out[sw[i,j]]++ < 1) { # Is it a new value?
indices[sw[i,j]]=j # add to array "indices".
}
}
n=asorti(indices) # sort the keys of indices
printf( "%s ", sw[i,0] )
for (k=1; k<=n; k++) { # all values for a switch.
printf( "%s ", indices[k] )
}
printf( "%s\n", "" )
}
}
' infile
结果:
final count of switches=3
switch=1 ibsw20
ibsw20 Infiniscale-IV compute060 compute061 compute062 compute063
compute064 compute065 compute066 compute067 compute068 compute069
compute070 compute071 compute072 compute073 compute074 compute075
compute076 compute077
switch=2 ibsw21
ibsw21 Infiniscale-IV compute060 compute061 compute062 compute063
compute064 compute065 compute066 compute067 compute068 compute069
compute070 compute071 compute072 compute073 compute074 compute075
compute076 compute077
switch=3 ibsw22
ibsw22 Infiniscale-IV compute060 compute062 compute063 compute074
compute076
我不确定是否应该删除 Infiniscale-IV,以及您是否还要求进行额外的处理以获得:
SwitchName=ibsw20 Nodes=compute[060-077]
Q2
来自“man awk”:
如果 RS 设置为空字符串,则记录由空行分隔。
即“记录分隔符”(RS) 设置为 null:
awk -v RS='' 'script to process lines' file
答案2
这基本上是 BinaryZebra 的答案,经过修改以创建 slurm topology.conf 文件:定义的
ibnetdiscover | awk -v FS='[#"]' '
BEGIN{c=0}
$1~/Switch/ {c++; j=0; split($5,arr,"[;:]" ); sw[c,0]=arr[2] }
$1~/\[[0-9]+\]/ && $2~/^H-/ { j++; split($5,arr," " ); sw[c,j]=arr[1] }
END {
# print("final count of switches=" c)
for (i=1; i<=c; i++) {
printf( "SwitchName=s" i, sw[i,0] ) # show switch number.
split("", out , ":" ) # delete array "out".
split("", indices , ":" ) # delete array "indices".
j=0
while (sw[i,++j]) { # for all array elements.
if (out[sw[i,j]]++ < 1) { # Is it a new value?
indices[sw[i,j]]=j # add to array "indices".
}
}
n=asorti(indices) # sort the keys of indices
# printf( "%s ", sw[i,0] )
printf ( " Nodes=" )
for (k=1; k<n; k++) { # all values for a switch.
printf( "%s,", indices[k] )
}
printf( "%s\n", indices[n] )
}
}
' | sed -r '/Nodes=$/d' | awk '{sub(/[0-9]+/, ++i)}1; END{printf( "SwitchName=s%s Switches=s[1-%s]\n", NR+1, NR )}'
如果需要压缩主机列表,只需使用 修改每个 Node= 行即可scontrol show hostlist
。修改后的最终管道如下所示:
| awk -F= '{sub(/[[:digit:]]+/, ++i) ; cmd= "scontrol show hostlist " $3 ; cmd | getline line ; printf( "%s=%s=%s\n" , $1, $2, line ) } END{printf( "SwitchName=s%s Switches=s[1-%s]\n", NR+1, NR )}'