寻找不同的可能组合

寻找不同的可能组合

文件 A 有几行基因:

A、B、C、D、E
P、Q、R
G、D、V、KL
、Q、X、I、U、G 等。

一次获取每一行,如何获得以下类型的输出:

对于第一行:

A、B、C
B、C、D
C、D、E

对于第二行:

P、Q、R

对于第三行:

G、D、V
D、V、K

本质上,我想要的是从每一行中找到基因的“三联体”。第一个三联体将具有前三个基因。第二个三联体将具有第二、第三、第四基因。最后一个三联体将以行中的最后一个基因结束。
手动实现这一目标将是一项艰巨的任务。由于我尚未掌握 Linux、Perl 或 Python 脚本,因此无法为此编写脚本,因此将不胜感激来自该社区的帮助!

答案1

使用awk

function wprint() {
    print w[1], w[2], w[3];
}

function wshift(e) {
    w[1] = w[2]; w[2] = w[3]; w[3] = e;
}

BEGIN { FS = OFS = "," }

{
    wshift($1);
    wshift($2);
    wshift($3);
    wprint();

    for (i = 4; i <= NF; ++i) {
        wshift($i);
        wprint();
    }
}

然后:

$ awk -f script data.in
A,B,C
B,C,D
C,D,E
P,Q,R
G,D,V
D,V,K
L,Q,X
Q,X,I
X,I,U
I,U,G

awk脚本使用三元素移动窗口w.对于每个输入行,它使用前三个字段填充窗口的三个元素,并将它们打印为逗号分隔的列表(后跟换行符)。然后,它迭代该行上的剩余字段,将它们移入窗口并打印每个元素的窗口。

如果输入数据中的任何行包含少于两个字段,您将得到类似的结果

A,,

或者

A,B,

在输出中。

如果您确定每个输入行至少具有三个字段(或者如果您想忽略任何没有的行),那么您可以awk稍微缩短脚本:

function wprint() {
    print w[1], w[2], w[3];
}

function wshift(e) {
    w[1] = w[2]; w[2] = w[3]; w[3] = e;
}

BEGIN { FS = OFS = "," }

{
    for (i = 1; i <= NF; ++i) {
        wshift($i);
        if (i >= 3) {
            wprint();
        }
    }
}

具有可变窗口大小的脚本第一个变体的概括:

function wprint(i) {
    for (i = 1; i < n; ++i) {
        printf("%s%s", w[i], OFS);
    }
    print w[n]
}

function wshift(e,i) {
    for (i = 1; i < n; ++i) {
        w[i] = w[i + 1];
    }
    w[n] = e;
}

BEGIN { FS = OFS = "," }

{
    for (i = 1; i <= n; ++i) {
        wshift($i);
    }
    wprint();

    for (i = n + 1; i <= NF; ++i) {
        wshift($i);
        wprint();
    }
}

使用它:

$ awk -v n=4 -f script data.in
A,B,C,D
B,C,D,E
P,Q,R,
G,D,V,K
L,Q,X,I
Q,X,I,U
X,I,U,G

答案2

perl

perl -F, -le 'BEGIN { $, = "," } while(@F >= 3) { print @F[0..2]; shift @F }' file

awk

awk -F, -v OFS=, 'NF>=3 { for(i=1; i<=NF-2; i++) print $i, $(i+1), $(i+2) }' file

答案3

使用 Perl,我们可以将其处理为:

perl -lne '/(?:([^,]+)(?=((?:,[^,]+){2}))(?{ print $1,$2 }))*$/' yourfile
perl -F, -lne '$,=","; print shift @F, @F[0..1] while @F >= 3' 
perl -F, -lne '$,=","; print splice @F, 0, 3, @F[1,2] while @F >= 3'

可以写成如下的扩展形式:

perl -lne '
   m/
      (?:                       # set up a do-while loop
         ([^,]+)                # first field which shall be deleted after printing
         (?=((?:,[^,]+){2}))    # lookahead and remember the next 2 fields
         (?{ print $1,$2 })     # print the first field + next 2 fields
      )*                        # loop back for more
      $                         # till we hit the end of line
   /x;
' yourfile

通过 sed,我们可以使用它的各种命令来做到这一点:

sed -e '
   /,$/!s/$/,/     # add a dummy comma at the EOL

   s/,/\n&/3;ta    # while there still are 3 elements in the line jump to label "a"
   d               # else quit processing this line any further

   :a              # main action
   P               # print the leading portion, i.e., that which is left of the first newline in the pattern space
   s/\n//          # take away the marker

   s/,/\n/;tb      # get ready to delete the first field
   :b

   D               # delete the first field, and apply the sed code all over from the beginning to what remains in the pattern space
' yourfile

DC 也可以这样做:

sed -e 's/[^,]*/[&]/g;y/,/ /' gene_data.in |
dc -e '
[q]sq                            # macro for quitting
[SM z0<a]sa                      # macro to store stack -> register "M"
[LMd SS zlk>b c]sb               # macro to put register "M" -> register "S"
[LS zlk>c]sc                     # macro to put register "S" -> stack
[n44an dn44an rdn10anr z3!>d]sd  # macro to print 1st three stack elements
[zsk lax lbx lcx ldx c]se        # macro that initializes & calls all other macros
[?z3>q lex z0=?]s?               # while loop to read in file line by line and run macro "e" on each line
l?x                              # main()
'

结果

A,B,C
B,C,D
C,D,E
D,E,F
E,F,G
P,Q,R
G,D,V
D,V,K
L,Q,X
Q,X,I
X,I,U
I,U,G

相关内容