根据不固定间隔的定义长度提取开始和结束坐标

Question 1

松散地说，问题在于合并线。如果一条线的起始坐标与上一行的结束坐标相同，则该线可以与前一行“合并”。

这些线可能对应于基因组特征。目的是合并基因组序列中相邻的特征。

这是一个awk执行此操作的脚本：

$2 == end {
    # This line merges with the previous line.
    # Update end and continue with next line.

    end = $3;
    next;
}

{
    # This is an unmergeable line (start doesn't correspond to end on
    # previous line).

    # If we've processed at least the header line, print the data collected.
    # The if statement avoids printing an empty output line at the 
    # start of the output.

    if (NR > 1) {
        print chr, start, end, score, len;
    }

    # Get data from this line.

    chr = $1;
    start = $2;
    end = $3;
    score = $4;
    len = $5;
}

END {
    # At the end of input, print the data as above to output last line.
    print chr, start, end, score, len;
}

该脚本假设输入已排序，并且所有起始坐标都严格小于结束坐标（即所有特征都位于正链上）。

测试它：

$ awk -f script.awk data
chr start end score length
chr1 237592 237912 176 320
chr1 521409 521729 150 320
chr1 714026 714346 83 320
chr1 805100 805440 323 340

Answer

松散地说，问题在于合并线。如果一条线的起始坐标与上一行的结束坐标相同，则该线可以与前一行“合并”。

这些线可能对应于基因组特征。目的是合并基因组序列中相邻的特征。

这是一个awk执行此操作的脚本：

$2 == end {
    # This line merges with the previous line.
    # Update end and continue with next line.

    end = $3;
    next;
}

{
    # This is an unmergeable line (start doesn't correspond to end on
    # previous line).

    # If we've processed at least the header line, print the data collected.
    # The if statement avoids printing an empty output line at the 
    # start of the output.

    if (NR > 1) {
        print chr, start, end, score, len;
    }

    # Get data from this line.

    chr = $1;
    start = $2;
    end = $3;
    score = $4;
    len = $5;
}

END {
    # At the end of input, print the data as above to output last line.
    print chr, start, end, score, len;
}

该脚本假设输入已排序，并且所有起始坐标都严格小于结束坐标（即所有特征都位于正链上）。

测试它：

$ awk -f script.awk data
chr start end score length
chr1 237592 237912 176 320
chr1 521409 521729 150 320
chr1 714026 714346 83 320
chr1 805100 805440 323 340

Question 2

对我来说，这看起来像一个文本文件，其中的列由空格分隔。它可以被优雅地处理，R但 shell 脚本也可以做到这一点。您需要的是使用循环逐行读取文件for。在循环中，一种简单的方法是将每个列值（您可以用于cut该值）分配给一个变量，然后按照您想要的顺序打印变量。第二列和第五列变量加在一起生成输出的第三列。您可以使用echofor 循环在屏幕上打印每条输出行。当您看到屏幕上打印出您喜欢的行时，您只需将脚本的输出重定向到类似的文件your_script.sh > your new output.txt

Answer

对我来说，这看起来像一个文本文件，其中的列由空格分隔。它可以被优雅地处理，R但 shell 脚本也可以做到这一点。您需要的是使用循环逐行读取文件for。在循环中，一种简单的方法是将每个列值（您可以用于cut该值）分配给一个变量，然后按照您想要的顺序打印变量。第二列和第五列变量加在一起生成输出的第三列。您可以使用echofor 循环在屏幕上打印每条输出行。当您看到屏幕上打印出您喜欢的行时，您只需将脚本的输出重定向到类似的文件your_script.sh > your new output.txt

根据不固定间隔的定义长度提取开始和结束坐标

答案1

答案2

相关内容