bash编程逐行比较文件并创建新文件

Question 1

连接+排序

如果您尝试查找两者中都存在的 IP，则可以使用该join命令，但您需要sort在加入文件之前使用该命令对文件进行预排序。

$ join -o 2.2 <(sort file1) <(sort file2)

例子

$ join -o 2.2 <(sort file1) <(sort file2)
1.765
0.326
4.754
3.673
6.334

另一个例子

文件1a：

$ cat file1a
34.123.21.32
45.231.43.21
21.34.67.98
1.2.3.4
5.6.7.8
9.10.11.12

文件2a：

$ cat file2a
34.123.21.32 0.326 - [30/Oct/2013:06:00:06 +0200]
45.231.43.21 6.334 - [30/Oct/2013:06:00:06 +0200]
45.231.43.21  3.673 - [30/Oct/2013:06:00:06 +0200]
34.123.21.32 4.754 - [30/Oct/2013:06:00:06 +0200]
21.34.67.98 1.765 - [30/Oct/2013:06:00:06 +0200]
1.2.3.4 1.234 - [30/Oct/2013:06:00:06 +0200]
4.3.2.1 4.321 - [30/Oct/2013:06:00:06 +0200]

运行join命令：

$ join -o 2.2 <(sort file1) <(sort file2)
1.234
1.765
0.326
4.754
3.673
6.334

笔记：file2由于我们首先对它进行了排序，因此此方法会丢失的原始顺序。然而，结果是，该方法现在只需要扫描file2一次。

grep

您可以使用中的行grep来搜索匹配项，但此方法不如我向您展示的第一种方法有效。它正在扫描寻找中的每一行。file2file1file2file1

$ grep -f file1 file2 | awk '{print $2}'

例子

$ grep -f file1 file2 | awk '{print $2}'
0.326
6.334
3.673
4.754
1.765
1.234

提高 grep 的性能

grep您可以使用以下形式加快的性能：

$ LC_ALL=C grep -f file1 file2 | awk '{print $2}'

您还可以看出grep，中的刺file1是固定长度的 ( -F)，这也将有助于获得更好的性能。

$ LC_ALL=C grep -Ff file1 file2 | awk '{print $2}'

一般来说，在软件中，您会尽量避免采用这种方法，因为它基本上是循环类型解决方案中的循环。但有时使用计算机+软件可以达到最好的效果。

参考

将文件视为集合并对其执行集合操作的 Linux 工具

Answer

连接+排序

如果您尝试查找两者中都存在的 IP，则可以使用该join命令，但您需要sort在加入文件之前使用该命令对文件进行预排序。

$ join -o 2.2 <(sort file1) <(sort file2)

例子

$ join -o 2.2 <(sort file1) <(sort file2)
1.765
0.326
4.754
3.673
6.334

另一个例子

文件1a：

$ cat file1a
34.123.21.32
45.231.43.21
21.34.67.98
1.2.3.4
5.6.7.8
9.10.11.12

文件2a：

$ cat file2a
34.123.21.32 0.326 - [30/Oct/2013:06:00:06 +0200]
45.231.43.21 6.334 - [30/Oct/2013:06:00:06 +0200]
45.231.43.21  3.673 - [30/Oct/2013:06:00:06 +0200]
34.123.21.32 4.754 - [30/Oct/2013:06:00:06 +0200]
21.34.67.98 1.765 - [30/Oct/2013:06:00:06 +0200]
1.2.3.4 1.234 - [30/Oct/2013:06:00:06 +0200]
4.3.2.1 4.321 - [30/Oct/2013:06:00:06 +0200]

运行join命令：

$ join -o 2.2 <(sort file1) <(sort file2)
1.234
1.765
0.326
4.754
3.673
6.334

笔记：file2由于我们首先对它进行了排序，因此此方法会丢失的原始顺序。然而，结果是，该方法现在只需要扫描file2一次。

grep

您可以使用中的行grep来搜索匹配项，但此方法不如我向您展示的第一种方法有效。它正在扫描寻找中的每一行。file2file1file2file1

$ grep -f file1 file2 | awk '{print $2}'

例子

$ grep -f file1 file2 | awk '{print $2}'
0.326
6.334
3.673
4.754
1.765
1.234

提高 grep 的性能

grep您可以使用以下形式加快的性能：

$ LC_ALL=C grep -f file1 file2 | awk '{print $2}'

您还可以看出grep，中的刺file1是固定长度的 ( -F)，这也将有助于获得更好的性能。

$ LC_ALL=C grep -Ff file1 file2 | awk '{print $2}'

一般来说，在软件中，您会尽量避免采用这种方法，因为它基本上是循环类型解决方案中的循环。但有时使用计算机+软件可以达到最好的效果。

参考

将文件视为集合并对其执行集合操作的 Linux 工具

Question 2

您可以grep使用开关-f（位于POSIX 标准）：

sort file1 | uniq \            # Avoid duplicate entries in file1
 | grep -f /dev/stdin file2 \  # Search in file2 for patterns piped on stdin
 | awk '{print $2}' \          # Print the second field (time) for matches
   > new_file                  # Redirect output to a new file

请注意，如果一个 IP 地址在中多次出现file2，则将打印其所有时间条目。

这在我的系统上完成了 500 万行文件的工作，时间不到 2 秒。

Answer

您可以grep使用开关-f（位于POSIX 标准）：

sort file1 | uniq \            # Avoid duplicate entries in file1
 | grep -f /dev/stdin file2 \  # Search in file2 for patterns piped on stdin
 | awk '{print $2}' \          # Print the second field (time) for matches
   > new_file                  # Redirect output to a new file

请注意，如果一个 IP 地址在中多次出现file2，则将打印其所有时间条目。

这在我的系统上完成了 500 万行文件的工作，时间不到 2 秒。

Question 3

正如您为问题命名的那样bash 编程我将提交一个半 bash 示例。

纯bash：

你可以阅读IP过滤器-file，然后逐行检查并将其与这些进行匹配。但在这个量上确实很慢。

你可以很容易地实现冒泡、选择、插入、合并排序等。但是，同样，对于这种类型的数据量，它会消失，而且很可能比按行比较更糟糕。（很大程度上取决于体积过滤文件）。

排序+重击：

另一种选择是通过例如二分搜索对文件进行排序sort并在内部处理输入。这也会比此处发布的其他建议慢得多，但让我们尝试一下。

首先是bash版本的问题。到版本 4（？），我们可以mapfile将文件读取到数组。这比传统的快得多read -ra …。结合sort它可以编写类似的脚本（对于此任务）：

mapfile arr <<< "$(sort -bk1,1 "$file_in")"

然后是一个关于使用搜索算法在该数组中查找匹配项的问题。一种简单的方法是使用二分搜索。它非常高效，例如，在包含 1.000.000 个元素的数组上，查找速度相当快。

declare -i match_index
function in_array_bs()
{
    local needle="$1"
    local -i max=$arr_len
    local -i min=0
    local -i mid
    while ((min < max)); do
        (( (mid = ((min + max) >> 1)) < max )) || break
        if [[ "${arr[mid]// *}" < "$needle" ]]; then
            ((min = mid + 1))
        else
            max=$mid
        fi
    done
    if [[ "$min" == "$max" && "${arr[min]// *}" == "$needle" ]]; then
        match_index=$min
        return 0
    fi
    return 1
}

那么你会说：

for x in "${filter[@]}"; do
    if in_array_bs "$x"; then
       … # check match_index+0,+1,+2 etc. to cover duplicates.

示例脚本。（未调试）但仅作为入门。对于较小的体积，人们只想依赖sort，它可以是一个模板。但再一次慢很多:

#!/bin/bash

file_in="file_data"
file_srch="file_filter"

declare -a arr       # The entire data file as array.
declare -i arr_len   # The length of "arr".
declare -i index     # Matching index, if any.

# Time print helper function for debug.
function prnt_ts() { date +"%H:%M:%S.%N"; }

# Binary search.
function in_array_bs()
{
    local needle="$1"
    local -i max=$arr_len
    local -i min=0
    local -i mid
    while ((min < max)); do
        (( (mid = ((min + max) >> 1)) < max )) || break
        if [[ "${arr[mid]// *}" < "$needle" ]]; then
            ((min = mid + 1))
        else
            max=$mid
        fi
    done
    if [[ "$min" == "$max" && "${arr[min]// *}" == "$needle" ]]; then
        index=$min
        return 0
    fi
    return 1
}

# Search.
# "index" is set to matching index in "arr" by `in_array_bs()`.
re='^[^ ]+ +([^ ]+)'
function search()
{
    if in_array_bs "$1"; then
        while [[ "${arr[index]// *}" == "$1" ]]; do
            [[ "${arr[index]}" =~ $re ]]
            printf "%s\n" "${BASH_REMATCH[1]}"
            ((++index))
        done
    fi
}

sep="--------------------------------------------"
# Timestamp start
ts1=$(date +%s.%N)

# Print debug information
printf "%s\n%s MAP: %s\n%s\n" \
    "$sep" "$(prnt_ts)" "$file_in" "$sep" >&2

# Read sorted file to array.
mapfile arr <<< "$(sort -bk1,1 "$file_in")"

# Print debug information.
printf "%s\n%s MAP DONE\n%s\n" \
    "$sep" "$(prnt_ts)" "$sep" >&2

# Define length of array.
arr_len=${#arr[@]}

# Print time start search
printf "%s\n%s SEARCH BY INPUT: %s\n%s\n" \
    "$sep" "$(prnt_ts)" "$file_srch" "$sep" >&2

# Read filter file.
re_neg_srch='^[ '$'\t'$'\n'']*$'
debug=0
while IFS=$'\n'$'\t'-" " read -r ip time trash; do
    if ! [[ "$ip" =~ $re_neg_srch ]]; then
        ((debug)) && printf "%s\n%s SEARCH: %s\n%s\n" \
            "$sep" "$(prnt_ts)" "$ip" "$sep" >&2
        # Do the search
        search "$ip"
    fi
done < "$file_srch"

# Print time end search
printf "%s\n%s SEARCH DONE\n%s\n" \
    "$sep" "$(prnt_ts)" "$sep" >&2

# Print total time
ts2=$(date +%s.%N)
echo $ts1 $ts2 | awk '{printf "TIME: %f\n", $2 - $1}' >&2

Answer

正如您为问题命名的那样bash 编程我将提交一个半 bash 示例。

纯bash：

你可以阅读IP过滤器-file，然后逐行检查并将其与这些进行匹配。但在这个量上确实很慢。

你可以很容易地实现冒泡、选择、插入、合并排序等。但是，同样，对于这种类型的数据量，它会消失，而且很可能比按行比较更糟糕。（很大程度上取决于体积过滤文件）。

排序+重击：

另一种选择是通过例如二分搜索对文件进行排序sort并在内部处理输入。这也会比此处发布的其他建议慢得多，但让我们尝试一下。

首先是bash版本的问题。到版本 4（？），我们可以mapfile将文件读取到数组。这比传统的快得多read -ra …。结合sort它可以编写类似的脚本（对于此任务）：

mapfile arr <<< "$(sort -bk1,1 "$file_in")"

然后是一个关于使用搜索算法在该数组中查找匹配项的问题。一种简单的方法是使用二分搜索。它非常高效，例如，在包含 1.000.000 个元素的数组上，查找速度相当快。

declare -i match_index
function in_array_bs()
{
    local needle="$1"
    local -i max=$arr_len
    local -i min=0
    local -i mid
    while ((min < max)); do
        (( (mid = ((min + max) >> 1)) < max )) || break
        if [[ "${arr[mid]// *}" < "$needle" ]]; then
            ((min = mid + 1))
        else
            max=$mid
        fi
    done
    if [[ "$min" == "$max" && "${arr[min]// *}" == "$needle" ]]; then
        match_index=$min
        return 0
    fi
    return 1
}

那么你会说：

for x in "${filter[@]}"; do
    if in_array_bs "$x"; then
       … # check match_index+0,+1,+2 etc. to cover duplicates.

示例脚本。（未调试）但仅作为入门。对于较小的体积，人们只想依赖sort，它可以是一个模板。但再一次慢很多:

#!/bin/bash

file_in="file_data"
file_srch="file_filter"

declare -a arr       # The entire data file as array.
declare -i arr_len   # The length of "arr".
declare -i index     # Matching index, if any.

# Time print helper function for debug.
function prnt_ts() { date +"%H:%M:%S.%N"; }

# Binary search.
function in_array_bs()
{
    local needle="$1"
    local -i max=$arr_len
    local -i min=0
    local -i mid
    while ((min < max)); do
        (( (mid = ((min + max) >> 1)) < max )) || break
        if [[ "${arr[mid]// *}" < "$needle" ]]; then
            ((min = mid + 1))
        else
            max=$mid
        fi
    done
    if [[ "$min" == "$max" && "${arr[min]// *}" == "$needle" ]]; then
        index=$min
        return 0
    fi
    return 1
}

# Search.
# "index" is set to matching index in "arr" by `in_array_bs()`.
re='^[^ ]+ +([^ ]+)'
function search()
{
    if in_array_bs "$1"; then
        while [[ "${arr[index]// *}" == "$1" ]]; do
            [[ "${arr[index]}" =~ $re ]]
            printf "%s\n" "${BASH_REMATCH[1]}"
            ((++index))
        done
    fi
}

sep="--------------------------------------------"
# Timestamp start
ts1=$(date +%s.%N)

# Print debug information
printf "%s\n%s MAP: %s\n%s\n" \
    "$sep" "$(prnt_ts)" "$file_in" "$sep" >&2

# Read sorted file to array.
mapfile arr <<< "$(sort -bk1,1 "$file_in")"

# Print debug information.
printf "%s\n%s MAP DONE\n%s\n" \
    "$sep" "$(prnt_ts)" "$sep" >&2

# Define length of array.
arr_len=${#arr[@]}

# Print time start search
printf "%s\n%s SEARCH BY INPUT: %s\n%s\n" \
    "$sep" "$(prnt_ts)" "$file_srch" "$sep" >&2

# Read filter file.
re_neg_srch='^[ '$'\t'$'\n'']*$'
debug=0
while IFS=$'\n'$'\t'-" " read -r ip time trash; do
    if ! [[ "$ip" =~ $re_neg_srch ]]; then
        ((debug)) && printf "%s\n%s SEARCH: %s\n%s\n" \
            "$sep" "$(prnt_ts)" "$ip" "$sep" >&2
        # Do the search
        search "$ip"
    fi
done < "$file_srch"

# Print time end search
printf "%s\n%s SEARCH DONE\n%s\n" \
    "$sep" "$(prnt_ts)" "$sep" >&2

# Print total time
ts2=$(date +%s.%N)
echo $ts1 $ts2 | awk '{printf "TIME: %f\n", $2 - $1}' >&2

bash编程逐行比较文件并创建新文件

答案1

连接+排序

例子

另一个例子

grep

例子

提高 grep 的性能

参考

答案2

答案3

纯bash：

排序+重击：

相关内容