过滤掉长度

Question 1

您可以使用 sed 或 awk 或 grep。

awk 'length($0)>200' file > newfile

或者

grep '^.\{201\}' file > newfile

Answer

您可以使用 sed 或 awk 或 grep。

awk 'length($0)>200' file > newfile

或者

grep '^.\{201\}' file > newfile

Question 2

您awk需要首先设置>为记录分隔符：

awk 'BEGIN{RS=">";ORS=""}length($0)>200{print ">"$0}' input > output

另一种选择pcregrep：

pcregrep -M '^>[^>]{201,}' input > output

或者只计算 DNA 序列，而不计算标题中的字符：

pcregrep -M '^>[^>]*\n[^>]{201,}' input > output

Answer

您awk需要首先设置>为记录分隔符：

awk 'BEGIN{RS=">";ORS=""}length($0)>200{print ">"$0}' input > output

另一种选择pcregrep：

pcregrep -M '^>[^>]{201,}' input > output

或者只计算 DNA 序列，而不计算标题中的字符：

pcregrep -M '^>[^>]*\n[^>]{201,}' input > output

Question 3

Python （split.py）：

import sys

# call with the file as parameter

base = 0
line = ''
with open(sys.argv[-1]) as fp:
    with open('shorter', 'w') as fps:
        with open('longer', 'w') as fpl:
            for x in fp:
                if line and x.startswith('>'):
                    print len(line), base
                    if (len(line) - base) >= 200:
                        fpl.write(line)
                    else:
                        fps.write(line)
                    line = x
                    base = len(x)  # lenght of the ">..." line
                    continue
                if x.startswith('>'):  # very first one
                    base = len(x)
                line += x
            if line:
                if len(line) >= 200:
                    fpl.write(line)
                else:
                    fps.write(line)
                line = ""

调用 withpython split.py inputfile然后mv shorter inputfile（检查文件正常后）

Answer

Python （split.py）：

import sys

# call with the file as parameter

base = 0
line = ''
with open(sys.argv[-1]) as fp:
    with open('shorter', 'w') as fps:
        with open('longer', 'w') as fpl:
            for x in fp:
                if line and x.startswith('>'):
                    print len(line), base
                    if (len(line) - base) >= 200:
                        fpl.write(line)
                    else:
                        fps.write(line)
                    line = x
                    base = len(x)  # lenght of the ">..." line
                    continue
                if x.startswith('>'):  # very first one
                    base = len(x)
                line += x
            if line:
                if len(line) >= 200:
                    fpl.write(line)
                else:
                    fps.write(line)
                line = ""

调用 withpython split.py inputfile然后mv shorter inputfile（检查文件正常后）

Question 4

cat file | while read -r line; do
  if [ ${#line} -gt 200 ]; then
    echo "${line}"
  fi
done

编辑问题已更新：不需要一行的长度，而是需要一组行的长度。

在下面的脚本中，我回显 >TCONS，否则脚本将跳过最后一次点击。

multiline=""
(cat input; echo ">TCONS string for last token") | while read line; do
        if [[ "$(echo "${line}"| cut -c1-6)" = ">TCONS" ]]; then
                if [ ${#multiline} -gt 200 ]; then
                        echo "${multiline}"
                fi
                multiline=""
        else
                multiline="${multiline}${line}"
        fi
done

Answer

cat file | while read -r line; do
  if [ ${#line} -gt 200 ]; then
    echo "${line}"
  fi
done

编辑问题已更新：不需要一行的长度，而是需要一组行的长度。

在下面的脚本中，我回显 >TCONS，否则脚本将跳过最后一次点击。

multiline=""
(cat input; echo ">TCONS string for last token") | while read line; do
        if [[ "$(echo "${line}"| cut -c1-6)" = ">TCONS" ]]; then
                if [ ${#multiline} -gt 200 ]; then
                        echo "${multiline}"
                fi
                multiline=""
        else
                multiline="${multiline}${line}"
        fi
done

过滤掉长度

答案1

答案2

答案3

答案4

相关内容