仅选择包含重复字符串的第一行

Question 1

您可以考虑使用sort -u作为替代uniq，将第一个以空格分隔的字段指定为键：

$ sort -uk1,1 file
2_00003 R034671 31.25   96  55  2   100 195 77  161 7e-07   47.8
2_00004 R014991 31.90   232 141 5   2   232 4   219 5e-28    111

或者，您可以做这样的事情awk：

awk '$1!=last {last=$1; print}' file

它测试每行的第一个字段（$1）与其last值是否一致，并且每当发生变化时打印该行，同时$1更新值。last

Answer

您可以考虑使用sort -u作为替代uniq，将第一个以空格分隔的字段指定为键：

$ sort -uk1,1 file
2_00003 R034671 31.25   96  55  2   100 195 77  161 7e-07   47.8
2_00004 R014991 31.90   232 141 5   2   232 4   219 5e-28    111

或者，您可以做这样的事情awk：

awk '$1!=last {last=$1; print}' file

它测试每行的第一个字段（$1）与其last值是否一致，并且每当发生变化时打印该行，同时$1更新值。last

Question 2

另一种python方法：

读取文件
列出第一列的唯一出现次数
列出列表中第一次出现的内容

#!/usr/bin/env python3
import sys
file = sys.argv[1]

with open(file) as src:
    lines = src.readlines()
for l in [[l for l in lines if l.startswith(f)][0] for f in set([l.split()[0] for l in lines])]:
    print(l, end = "")

使用文本文件作为参数运行它：

python3 <script> <text_file>

笔记

尽管上面的选项被证明是一个快速的选项（在一个超过 1000000 行的文件上测试），但如果我们可以假设第一列中的字符串不会出现在记录中的其他位置（可能是一个安全的假设），那么它可以更快（在我运行的测试中大约 15％）。在这种情况下，我们可以跳过该startswith()函数：

#!/usr/bin/env python3
import sys
file = sys.argv[1]

with open(file) as src:
    lines = src.readlines()
for l in [[l for l in lines if f in l][0] for f in set([l.split()[0] for l in lines])]:
    print(l, end = "")

Answer

另一种python方法：

读取文件
列出第一列的唯一出现次数
列出列表中第一次出现的内容

#!/usr/bin/env python3
import sys
file = sys.argv[1]

with open(file) as src:
    lines = src.readlines()
for l in [[l for l in lines if l.startswith(f)][0] for f in set([l.split()[0] for l in lines])]:
    print(l, end = "")

使用文本文件作为参数运行它：

python3 <script> <text_file>

笔记

尽管上面的选项被证明是一个快速的选项（在一个超过 1000000 行的文件上测试），但如果我们可以假设第一列中的字符串不会出现在记录中的其他位置（可能是一个安全的假设），那么它可以更快（在我运行的测试中大约 15％）。在这种情况下，我们可以跳过该startswith()函数：

#!/usr/bin/env python3
import sys
file = sys.argv[1]

with open(file) as src:
    lines = src.readlines()
for l in [[l for l in lines if f in l][0] for f in set([l.split()[0] for l in lines])]:
    print(l, end = "")

Question 3

您可以在这样的脚本中执行此操作：

first_occurence.sh（设置为可执行）

#!/bin/bash

# Set bash to separate words by newlines only, not spaces
IFS=$'\n'
# read input
input=("$(cat)")

# get a list of unique keys - split input by space with awk for any length
unique_values=($(printf "%s\n" "${input[*]}" | awk -F' ' '{ print $1 }' | uniq))

cur=0

# check each line of input for the key
for line in ${input[@]};
do  
    # wildcard matching
    if [[ "$line" == "${unique_values[$cur]}"* ]]
    then
        # print line if match, and move on to checking the next key
        printf "%s\n" "$line"
        cur=$((cur + 1))
    fi  
    # break the loop if we have used up all of our unique keys (only duplicates remain)
    if [ $cur -ge ${#unique_values[@]} ]
    then
        break
    fi  

done

通过管道输入文件来运行：

./first_occurence.sh < filename

Answer

您可以在这样的脚本中执行此操作：

first_occurence.sh（设置为可执行）

#!/bin/bash

# Set bash to separate words by newlines only, not spaces
IFS=$'\n'
# read input
input=("$(cat)")

# get a list of unique keys - split input by space with awk for any length
unique_values=($(printf "%s\n" "${input[*]}" | awk -F' ' '{ print $1 }' | uniq))

cur=0

# check each line of input for the key
for line in ${input[@]};
do  
    # wildcard matching
    if [[ "$line" == "${unique_values[$cur]}"* ]]
    then
        # print line if match, and move on to checking the next key
        printf "%s\n" "$line"
        cur=$((cur + 1))
    fi  
    # break the loop if we have used up all of our unique keys (only duplicates remain)
    if [ $cur -ge ${#unique_values[@]} ]
    then
        break
    fi  

done

通过管道输入文件来运行：

./first_occurence.sh < filename

Question 4

我认为使用 steeldriver 的解决方案sort是最好的，但如果您想尝试其他方法，请查看以下 Python 脚本：

#!/usr/bin/python2
import re
def checking():
        if not check_list:
            result.append(list_of_lines[index - 1])
with open('/path/to/the/file') as f:
    list_of_lines = f.readlines()
    result = []
    for index in range(1, len(list_of_lines)):
        regex_current = re.search('^[0-9]_[0-9]+', list_of_lines[index])
        regex_previous = re.search('^[0-9]_[0-9]+', list_of_lines[index - 1])
        check_list = [x for x in result if x.split()[0] == regex_previous.group()]
        if regex_current == regex_previous:
            checking()
        else:
            checking()
print ''.join(result)

Answer

我认为使用 steeldriver 的解决方案sort是最好的，但如果您想尝试其他方法，请查看以下 Python 脚本：

#!/usr/bin/python2
import re
def checking():
        if not check_list:
            result.append(list_of_lines[index - 1])
with open('/path/to/the/file') as f:
    list_of_lines = f.readlines()
    result = []
    for index in range(1, len(list_of_lines)):
        regex_current = re.search('^[0-9]_[0-9]+', list_of_lines[index])
        regex_previous = re.search('^[0-9]_[0-9]+', list_of_lines[index - 1])
        check_list = [x for x in result if x.split()[0] == regex_previous.group()]
        if regex_current == regex_previous:
            checking()
        else:
            checking()
print ''.join(result)

仅选择包含重复字符串的第一行

答案1

答案2

笔记

答案3

答案4

相关内容