我是 unix 新手,我有一个关于数据子集的问题,我将不胜感激任何人的帮助。我有 23G 输入文件,包含数百万行,但我只想保留第一列和第四列相同的行(支架名称)。这是我的数据集的前几行:
tscaffold94_798049_802097 999 NA tscaffold94_798049_802097 999 NA 1
tscaffold94_798049_802097 999 NA tscaffold94_798049_802097 1029 NA 1
tscaffold94_798049_802097 999 NA tscaffold94_798049_802097 1044 NA -0.0463767871013283
tscaffold94_798049_802097 999 NA tscaffold94_798049_802097 1045 NA -0.939576278422824
tscaffold94_798049_802097 999 NA tscaffold94_798049_802097 1130 NA -0.0831304705346077
tscaffold94_798049_802097 999 NA tscaffold94_798049_802097 1180 NA -0.931681175211672
tscaffold94_798049_802097 999 NA tscaffold94_798049_802097 1187 NA -0.940010336852543
tscaffold94_798049_802097 999 NA tscaffold94_798049_802097 1202 NA 1
tscaffold94_798049_802097 999 NA tscaffold94_798049_802097 1224 NA 1
tscaffold94_798049_802097 999 NA tscaffold94_798049_802097 1269 NA 1
tscaffold94_798049_802097 999 NA tscaffold94_798049_802097 1313 NA -0.201478578143067
tscaffold94_798049_802097 999 NA tscaffold94_798049_802097 1384 NA 1
tscaffold94_798049_802097 999 NA tscaffold94_878564_884314 3259 NA -0.595441932439136
tscaffold94_798049_802097 999 NA tscaffold94_878564_884314 3304 NA 0.745699172241005
tscaffold94_798049_802097 999 NA tscaffold94_878564_884314 3319 NA -0.570318634275133
tscaffold94_798049_802097 999 NA tscaffold94_878564_884314 3588 NA -0.60363963711489
答案1
awk
在这种情况下是你的朋友;这些列成为脚本中的变量awk
,因此很容易检查是否等价,并执行打印等操作(隐含当前行)
awk '{if($1 == $4) print}' < input