有条件地用数字替换行

有条件地用数字替换行

我有一个包含 27 列和近 600 万行的大文件。以下是我的文件的一个小例子

head data
0.65   0.722222   1.0      0.75     0
0.35   0.277778   0.0      0.25     0
0      0.666667   0.75     0.5      0.5625
0      0.333333   0.25     0.5      0.4375 

行是样本,我有“每个样本 2 行”(一个用于观察“a”,另一个用于观察“b”)。在上面的示例中,我显示了 2 个样本的数据(第 1 行和第 2 行对应于样本 1,第 3 行和第 4 行对应于样本 2)。我想检查每个样本的两个观察值是否均为 0,将其替换为 9。这是我想要的输出:

head desired
0.65   0.722222   1.0      0.75     9
0.35   0.277778   0.0      0.25     9
9      0.666667   0.75     0.5      0.5625
9      0.333333   0.25     0.5      0.4375 

任何perl或python或bash(如果对于这么大的文件可靠的话)解决方案如何做到这一点?过去我只是分割每个样本的文件并为每个文件运行以下代码

awk 'NR==1 { split($0,a);next;} NR==2 {split($0,b);for(i=1;i<= NF;i++) printf("%s%s",(i==1?"":"\t"),a[i]==0 && b[i]==0?9:a[i]);
printf("\n");;for(i=1;i<= NF;i++) printf("%s%s",(i==1?"":"\t"),a[i]==0 && b[i]==0?9:b[i]);printf("\n");} ' 

但现在我想对整个文件执行此操作,不想拆分它。

谢谢。

谢谢。

答案1

以下是我在 Python 中执行此操作的方法:

#!/usr/bin/env python3

firstLineZero = False

# Open the file for reading
with open("biodata2", "r") as inFile:
    for line in inFile:
        # Check if last value in line is 0
        if not firstLineZero and line.split()[-1] == "0":
            # Save this line, and set a boolean
            firstLineZero = True
            prevLine = line
        elif firstLineZero and line.split()[-1] == "0":
            # Now we know that both lines end with 0.
            # Change the final value to 9 in both lines...
            prevLineSplit = prevLine.split()
            thisLineSplit = line.split()
            prevLineSplit[-1] = "9" 
            thisLineSplit[-1] = "9" 
            prevLine = "\t".join(prevLineSplit)
            thisLine = "\t".join(thisLineSplit)
            print(prevLine)
            print(thisLine)
            # Reset boolean
            firstLineZero = False
            # Reset prevLine
            prevLine = ""
        else:
            print(line, end="")

# If we have a 'trailing' saved line, print that
if prevLine is not None:
    print(prevLine, end="")

执行示例,用几行代码来提供 POC。

数据:

cat biodata2 
0.65    0.722222    1.0     0.75    0
0.35    0.277778    0.0     0.25    0
0       0.666667    0.75    0.5     0.5625
0       0.333333    0.25    0.5     0.4375
0       0.333333    0.25    0.5     1
0       0.333333    0.25    0.5     0

执行:

./readBioData.py
0.65    0.722222    1.0     0.75    9
0.35    0.277778    0.0     0.25    9
0       0.666667    0.75    0.5     0.5625
0       0.333333    0.25    0.5     0.4375
0       0.333333    0.25    0.5     1
0       0.333333    0.25    0.5     0

显然,如果您想将其保存到文件而不是打印到stdout,则必须将print语句更改为write并设置一个用于写入的文件。

就像这样:

#!/usr/bin/env python3

firstLineZero = False
outFile = open("bioDataOut.txt", "w")

# Open the file for reading
with open("biodata2", "r") as inFile:
    for line in inFile:
        # Check if last value in line is 0
        if not firstLineZero and line.split()[-1] == "0":
            # Save this line, and set a boolean
            firstLineZero = True
            prevLine = line
        elif firstLineZero and line.split()[-1] == "0":
            # Now we know that both lines end with 0.
            # Change the final value to 9 in both lines...
            prevLineSplit = prevLine.split()
            thisLineSplit = line.split()
            prevLineSplit[-1] = "9" 
            thisLineSplit[-1] = "9" 
            prevLine = "\t".join(prevLineSplit)
            thisLine = "\t".join(thisLineSplit)
            outFile.write(prevLine + "\n")
            outFile.write(thisLine + "\n")
            # Reset boolean
            firstLineZero = False
            # Reset prevLine
            prevLine = ""
        else:
            outFile.write(line)

# If we have a 'trailing' saved line, print that
if prevLine is not None:
    outFile.write(prevLine)

outFile.close()

然后你可以这样做:

./readBioDataSaveToFile.py
cat bioDataOut.txt 
0.65    0.722222    1.0     0.75    9
0.35    0.277778    0.0     0.25    9
0       0.666667    0.75    0.5     0.5625
0       0.333333    0.25    0.5     0.4375
0       0.333333    0.25    0.5     1
0       0.333333    0.25    0.5     0

答案2

处理线对的技巧是将它们合并:

paste - - < paired_file

然后您可以使用 awk 测试/操作字段($1==0 && $6==0等)

相关内容