在块中插入缺失的数据

在块中插入缺失的数据

我是一名生物学家,我正在努力解决这个问题,请指导我从哪里开始。我不知道该发布到哪个论坛,如果这个地方不合适,请告诉我。

我有一些值块,这些值可以严格是两个来源或多个来源的混合。

'source1', 'source2' and 'mixture'是真实数据中的关键词。

源值限制在集合内{ AA, TT, GG , CC }

混合值被限制在集合内{ AA , TT , GG , CC , AT , AG, AC , TG , TC , GC } ,但混合值取决于它们在同一块内的源。所以如果在 blockN 内,

 source1 =XX  where X~{A,T,G,C}



source2 =YY  where Y~{A,T,G,C}

那么混合值必须在{ XX, YY, XY }

有时,我的数据中缺少一个或两个源,在这种情况下,我想插入缺少的源值,

如果在一个块内, Source1 缺失, Source2 缺失XX,并且混合值之一是YY,那么我们知道 Source1 是YY。另一个示例是,如果在块中, Source1 缺失, Source2 缺失XX,且混合值之一为XY,则 Source1 为YY。正如您所看到的,有以上两种方法可以根据混合物集中存在的内容来了解​​来源。

可能存在两个源都不存在的情况,但块中存在混合值 XY。这告诉我们 Source1 和 Source2 是 XX 和 YY(或者YY and XX,顺序无关紧要)。

如果我的示例数据是

block1 source1 AA 
block1 source2 TT
block1 mixture AT
block1 mixture AA
block1 mixture TT

block2 source1 GG
block2 source2 TT
block2 mixture TG
block2 mixture TG
block2 mixture TT

block3 source1 AA
block3 source2 TT
block3 mixture AT
block3 mixture AA
block3 mixture TT

block4 mixture AT
block4 mixture AA
block4 mixture TT

block5 source2 TT
block5 mixture TG
block5 mixture TG

我想要的输出是

block1 source1 AA 
block1 source2 TT
block1 mixture AT
block1 mixture AA
block1 mixture TT

block2 source1 GG
block2 source2 TT
block2 mixture TG
block2 mixture TG
block2 mixture TT

block3 source1 AA
block3 source2 TT
block3 mixture AT
block3 mixture AA
block3 mixture TT

block4 source1 AA
block4 source2 TT
block4 mixture AT
block4 mixture AA
block4 mixture TT


block5 source1 GG
block5 source2 TT
block5 mixture TG
block5 mixture TG

请注意第 4 块和第 5 块中的插入内容。为了便于理解,我将这些块分开;在实际数据中,它们不是用空行分隔的。

答案1

感觉可以用更简单的方式来完成,但经过一个小时的绞尽脑汁后我能想到的最好的方法是这个Python脚本:

#! /usr/bin/env python3

import sys, os

class Block:
    block_id = ''
    source1 = ''
    source2 = ''
    mixtures = []
    def __init__(self, block_id = '', source1 = '', source2 = '', mixtures = []):
        self.block_id = block_id
        self.mixtures = mixtures
        self.source1 = source1
        self.source2 = source2

        # Convert mixtures to a set of characters. For example, 
        #     ''.join(["AT", "TT"]) 
        # creates the string "ATTT". set() then converts that string to 
        # a set of characters {'A', 'T'}
        sources = set(''.join(mixtures))

        # If a source is empty, we take from the set the first element (pop()) 
        # after removing the other source (difference()). Since the set 
        # contains single characters, we double it to get "AA", "TT", etc.
        if self.source1 == '':
            self.source1 = sources.difference(set(self.source2)).pop()*2
        sources.remove (self.source1[0])
        if self.source2 == '':
            self.source2 = sources.pop()*2

    def print (self):
        print (self.block_id, "source1", self.source1)
        print (self.block_id, "source2", self.source2)
        for mix in self.mixtures:
            print (self.block_id, "mixture", mix)

if len(sys.argv) == 1:
    files = [os.stdin]
else:
    files = (open(f) for f in sys.argv[1:])

for f in files:
    # Read in all the lines
    data = [line for rawline in f for line in [rawline.strip().split(' ')]]
    # Get the unique block IDs
    blocks = set (lines[0] for line in data)
    # For each block ID
    for b in blocks:
        # The corresponding mixtures
        mix = [line[2] for line in data if line[0] == b and "mixture" == line[1]]

        # If "source1 XX" is present, we will get the list ['XX'], and [] if 
        # source1 is not present. ''.join() allows us to flatten ['XX']  to 
        # just 'XX' (and doesn't affect []). Similarly for source2.
        source1 = ''.join(d[2] for line in data if line[0] == b and "source1" == line[1])
        source2 = ''.join(d[2] for line in data if line[0] == b and "source2" == line[1])

        # Create an object of the class defined above, and print it.
        # Initialization fills up the blank values.
        Block(b, source1, source2, mix).print()

即使这样,这也将提供随机的、无序的输出(即block3数据可能先于block1等)。

将其保存在脚本中(例如,insert.py)并运行:

python3 insert.py inputfile

我重写了这个awk:

#! /usr/bin/awk -f

function build (block, source1, source2, sources, mixtures)
{       
    if (! source1)
    {
        for (char in sources)
        {
            if (source2 != char char)
            {
                source1 = char char
                delete sources[char]
                break
            }
        }
    }
    if (! source2)
    {   
        for (char in sources)
        {
            if (source1 != char char)
            {
                source2 = char char
                delete sources[char]
                break
            }
        }
    }
    printf "%s %s %s\n", block, "source1", source1
    printf "%s %s %s\n", block, "source2", source2
    for (m in mixtures)
    {
        for (i = 0; i < mixtures[m]; i++)
        {
            printf "%s %s %s\n", block, "mixture", m
        }
    }

}

{
    if (prev != $1)
    {
        if (prev in data)
        {
            build(prev, source1, source2, sources, mixtures)
        }

        prev = $1
        source1 = ""
        source2 = ""
        delete sources
        delete mixtures
    }

    data[$1]++
    if ($2 == "source1") {source1 = $3; next}
    if ($2 == "source2") {source2 = $3; next}
    if ($2 == "mixture")
    {
        mixtures[$3]++ 
        split ($3, chars, "")
        for (i=1; i <= length($3); i++)
        {
            sources[chars[i]]++
        }
    }
}

END { build(prev, source1, source2, sources, mixtures) }

将其保存在脚本中(例如insert.awk),chmod +x然后运行它:

./insert.awk inputfile

现在它也应该保留订单。请注意,我使用了delete,它可能不存在于某些 awks 中(但应该存在于 GNU awk 和 mawk 中)。

相关内容