从字段中提取长度为 n 的数字并返回字符串

Question 1

如果我理解正确的话，你希望第五列成为其中所有 6 位数字与空格的串联。

或许：

perl -F'\t' -lape '
   $F[4] = join " ", grep {length == 6} ($F[4] =~ /\d+/g);
   $_ = join "\t", @F' < file

或者重复使用你对操作员的负面看法：

perl -F'\t' -lape '
   $F[4] = join " ", ($F[4] =~ /(?<!\d)\d{6}(?!\d)/g);
   $_ = join "\t", @F' < file

和awk：

awk -F'\t' -v OFS='\t' '
  {
    repl = sep = ""
    while (match($5, /[0-9]+/)) {
      if (RLENGTH == 6) {
        repl = repl sep substr($5, RSTART, RLENGTH)
        sep = " "
      }
      $5 = substr($5, RSTART+RLENGTH)
    }
    $5 = repl
    print
  }' < file

grep其本身并不足以完成这项任务。grep旨在打印与模式匹配的行。一些实现如 GNU 或 ast-open grep，或者pcregrep可以从匹配行中提取字符串，但这非常有限。

我能想到的唯一可以在某些限制下工作的 ++cut方法是实现：greppastepcregrep grep

n='(?:.*?((?1)))?'
paste <(< file cut -f1-4) <(< file cut -f5 |
  pcregrep --om-separator=" " -o1 -o2 -o3 -o4 -o5 -o6 -o7 -o8 -o9 \
    "((?<!\d)\d{6}(?!\d))$n$n$n$n$n$n$n$n"
  ) <(< file cut -f6-)

假设每行输入至少有 6 个字段，并且每个字段的第 5 个字段有 1 到 9 个 6 位数字。

Answer

如果我理解正确的话，你希望第五列成为其中所有 6 位数字与空格的串联。

或许：

perl -F'\t' -lape '
   $F[4] = join " ", grep {length == 6} ($F[4] =~ /\d+/g);
   $_ = join "\t", @F' < file

或者重复使用你对操作员的负面看法：

perl -F'\t' -lape '
   $F[4] = join " ", ($F[4] =~ /(?<!\d)\d{6}(?!\d)/g);
   $_ = join "\t", @F' < file

和awk：

awk -F'\t' -v OFS='\t' '
  {
    repl = sep = ""
    while (match($5, /[0-9]+/)) {
      if (RLENGTH == 6) {
        repl = repl sep substr($5, RSTART, RLENGTH)
        sep = " "
      }
      $5 = substr($5, RSTART+RLENGTH)
    }
    $5 = repl
    print
  }' < file

grep其本身并不足以完成这项任务。grep旨在打印与模式匹配的行。一些实现如 GNU 或 ast-open grep，或者pcregrep可以从匹配行中提取字符串，但这非常有限。

我能想到的唯一可以在某些限制下工作的 ++cut方法是实现：greppastepcregrep grep

n='(?:.*?((?1)))?'
paste <(< file cut -f1-4) <(< file cut -f5 |
  pcregrep --om-separator=" " -o1 -o2 -o3 -o4 -o5 -o6 -o7 -o8 -o9 \
    "((?<!\d)\d{6}(?!\d))$n$n$n$n$n$n$n$n"
  ) <(< file cut -f6-)

假设每行输入至少有 6 个字段，并且每个字段的第 5 个字段有 1 到 9 个 6 位数字。

Question 2

awk '
BEGIN {
    FS = "\t";
    OFS = "\t";
}
{
    cnt = patsplit($5, arr, /[0-9]{6}/);
    $5 = arr[1];
    for(i = 2; i <= cnt; i++) {
        $5 = $5 " " arr[i];
    }
    print;
}' input.txt

patsplit(s, a [, r [, seps] ])- 分割字符串 s进入数组A和分隔符数组塞普斯在正则表达式上r，并返回字段数。 元素值是 s 中与 r 匹配的部分。

输入：

gene1   NM_033629   598G>A  P912    syndrome 1, 192315 syndrome 2, 225750 syndrome 3 610448 score   AD  hom user    123456  Source
gene2   NM_000459   613G>A  V115I   syndrome 1 600195   score   AD  rec user    234567  Source

输出：

gene1   NM_033629   598G>A  P912    192315 225750 610448    score   AD  hom user    123456  Source
gene2   NM_000459   613G>A  V115I   600195  score   AD  rec user    234567  Source

Answer

awk '
BEGIN {
    FS = "\t";
    OFS = "\t";
}
{
    cnt = patsplit($5, arr, /[0-9]{6}/);
    $5 = arr[1];
    for(i = 2; i <= cnt; i++) {
        $5 = $5 " " arr[i];
    }
    print;
}' input.txt

patsplit(s, a [, r [, seps] ])- 分割字符串 s进入数组A和分隔符数组塞普斯在正则表达式上r，并返回字段数。 元素值是 s 中与 r 匹配的部分。

输入：

gene1   NM_033629   598G>A  P912    syndrome 1, 192315 syndrome 2, 225750 syndrome 3 610448 score   AD  hom user    123456  Source
gene2   NM_000459   613G>A  V115I   syndrome 1 600195   score   AD  rec user    234567  Source

输出：

gene1   NM_033629   598G>A  P912    192315 225750 610448    score   AD  hom user    123456  Source
gene2   NM_000459   613G>A  V115I   600195  score   AD  rec user    234567  Source

从字段中提取长度为 n 的数字并返回字符串

答案1

答案2

相关内容