我有一个包含 27 列和近 600 万行的大文件。以下是我的文件的一个小例子
head data
0.65 0.722222 1.0 0.75 0
0.35 0.277778 0.0 0.25 0
0 0.666667 0.75 0.5 0.5625
0 0.333333 0.25 0.5 0.4375
行是样本,我有“每个样本 2 行”(一个用于观察“a”,另一个用于观察“b”)。在上面的示例中,我显示了 2 个样本的数据(第 1 行和第 2 行对应于样本 1,第 3 行和第 4 行对应于样本 2)。我想检查每个样本的两个观察值是否均为 0,将其替换为 9。这是我想要的输出:
head desired
0.65 0.722222 1.0 0.75 9
0.35 0.277778 0.0 0.25 9
9 0.666667 0.75 0.5 0.5625
9 0.333333 0.25 0.5 0.4375
任何perl或python或bash(如果对于这么大的文件可靠的话)解决方案如何做到这一点?过去我只是分割每个样本的文件并为每个文件运行以下代码
awk 'NR==1 { split($0,a);next;} NR==2 {split($0,b);for(i=1;i<= NF;i++) printf("%s%s",(i==1?"":"\t"),a[i]==0 && b[i]==0?9:a[i]);
printf("\n");;for(i=1;i<= NF;i++) printf("%s%s",(i==1?"":"\t"),a[i]==0 && b[i]==0?9:b[i]);printf("\n");} '
但现在我想对整个文件执行此操作,不想拆分它。
谢谢。
谢谢。
答案1
以下是我在 Python 中执行此操作的方法:
#!/usr/bin/env python3
firstLineZero = False
# Open the file for reading
with open("biodata2", "r") as inFile:
for line in inFile:
# Check if last value in line is 0
if not firstLineZero and line.split()[-1] == "0":
# Save this line, and set a boolean
firstLineZero = True
prevLine = line
elif firstLineZero and line.split()[-1] == "0":
# Now we know that both lines end with 0.
# Change the final value to 9 in both lines...
prevLineSplit = prevLine.split()
thisLineSplit = line.split()
prevLineSplit[-1] = "9"
thisLineSplit[-1] = "9"
prevLine = "\t".join(prevLineSplit)
thisLine = "\t".join(thisLineSplit)
print(prevLine)
print(thisLine)
# Reset boolean
firstLineZero = False
# Reset prevLine
prevLine = ""
else:
print(line, end="")
# If we have a 'trailing' saved line, print that
if prevLine is not None:
print(prevLine, end="")
执行示例,用几行代码来提供 POC。
数据:
cat biodata2
0.65 0.722222 1.0 0.75 0
0.35 0.277778 0.0 0.25 0
0 0.666667 0.75 0.5 0.5625
0 0.333333 0.25 0.5 0.4375
0 0.333333 0.25 0.5 1
0 0.333333 0.25 0.5 0
执行:
./readBioData.py
0.65 0.722222 1.0 0.75 9
0.35 0.277778 0.0 0.25 9
0 0.666667 0.75 0.5 0.5625
0 0.333333 0.25 0.5 0.4375
0 0.333333 0.25 0.5 1
0 0.333333 0.25 0.5 0
显然,如果您想将其保存到文件而不是打印到stdout
,则必须将print
语句更改为write
并设置一个用于写入的文件。
就像这样:
#!/usr/bin/env python3
firstLineZero = False
outFile = open("bioDataOut.txt", "w")
# Open the file for reading
with open("biodata2", "r") as inFile:
for line in inFile:
# Check if last value in line is 0
if not firstLineZero and line.split()[-1] == "0":
# Save this line, and set a boolean
firstLineZero = True
prevLine = line
elif firstLineZero and line.split()[-1] == "0":
# Now we know that both lines end with 0.
# Change the final value to 9 in both lines...
prevLineSplit = prevLine.split()
thisLineSplit = line.split()
prevLineSplit[-1] = "9"
thisLineSplit[-1] = "9"
prevLine = "\t".join(prevLineSplit)
thisLine = "\t".join(thisLineSplit)
outFile.write(prevLine + "\n")
outFile.write(thisLine + "\n")
# Reset boolean
firstLineZero = False
# Reset prevLine
prevLine = ""
else:
outFile.write(line)
# If we have a 'trailing' saved line, print that
if prevLine is not None:
outFile.write(prevLine)
outFile.close()
然后你可以这样做:
./readBioDataSaveToFile.py
cat bioDataOut.txt
0.65 0.722222 1.0 0.75 9
0.35 0.277778 0.0 0.25 9
0 0.666667 0.75 0.5 0.5625
0 0.333333 0.25 0.5 0.4375
0 0.333333 0.25 0.5 1
0 0.333333 0.25 0.5 0
答案2
处理线对的技巧是将它们合并:
paste - - < paired_file
然后您可以使用 awk 测试/操作字段($1==0 && $6==0
等)