解析 csv 文件以根据列值中匹配的字符集过滤行

解析 csv 文件以根据列值中匹配的字符集过滤行

考虑以下 csv 文件:

A,3300   
B,8440   
B,8443   
B,8444 
C,304
C,404  
M,5502   
M,5511

实际的 csv 文件很大(大约 60,000 行)。我只提供了一个小版本来描述该问题。

我需要创建一个脚本来根据第二个字段过滤行,以将具有匹配字符集的行分组到单行中(用匹配字符集替换第二个字段)。

换句话说,我期望上面给定的 csv 文件得到以下输出:

A,3300   
B,844  
C,304
C,404 
M,55   

请注意,只有第二个 csv 字段中的内容与脚本的目的相关。因此,其他字段中出现的任何匹配/不匹配的情况都需要按原样保留在文件中。

awk 对完成这项任务有用吗?或者任何其他内置功能?任何帮助都感激不尽。

答案1

我编写了一个小awk函数来查找两个字符串之间的共同起始字符:

awk '
BEGIN{OFS=FS=","}
function common_chars(a,b, o){
    split(a,asplit,"")
    split(b,bsplit,"")
    n=1
    while (asplit[n]==bsplit[n]){
        o=o""asplit[n]
        n++
    }
    return o
}
s[$1] {v[$1]=common_chars(v[$1],$2)}
!s[$1] {v[$1]=$2;s[$1]=1 }
END {for(a in v){print a,v[a]}}
' file

如果$1没有见过(状态保存在s[$1]),则保存$2在 array 中v[$1]=$2。如果已看到,请v[$1]在其自身和 之间设置函数的返回值$2。该函数只是对单个字符运行 while 循环,直到发现第一个字符不匹配。

对于C,404并且C,304它将打印C,

输出:

A,3300   
B,844
C,
M,55

答案2

对于 60,000 行来说,这可能会有点慢,但似乎可行。做不是在这里加上引号$line

我仍然有一种奇怪的感觉,认为该脚本中的某处存在错误,这会显示更多的数据需要处理......

$ sort -u testfile | datamash -t, -g1 collapse 2  \
| tr ',' ' ' | while read line ; do ./my_filter $line ; done
A,3300
B,844
C,304
C,404
M,55

预处理数据datamash并获取排序后的数据,我可以my_filter逐行输入:

$ sort -u testfile | datamash -t, -g1 collapse 2 
A,3300
B,8440,8443,8444
C,304,404
M,5502,5511

现在是my_filter

$ cat my_filter
#!/bin/bash
_longest_match () {
  if ((${#1}>${#2})); then
    long="$1" short="$2"
  else
    long="$2" short="$1"
  fi

  lshort=${#short}
  score=0
  for ((l=score+1;l<=lshort;++l)); do
    sub="${short:0:l}"

    [[ $long != $sub* ]] && break
    subfound="$sub" score="$l"
  done

  if ((score)); then
    printf '%s\n' "$subfound"
  fi
} # ----------  end of function _longest_match  ----------


_output () {
  for item in $(echo "$@"|tr ' ' '\n' | sort -u) ; do
    printf '%s,%s\n' "$key" "$item"
  done
} # ----------  end of function _output  ----------

declare -A matches
declare -A no_matches

key=$1
shift

for item in $( printf '%s\n' "$@"| sort -nr ); do
  if [ -z "$one" ]; then
    one=$1
    two=${2:-$1}
    shift 2
  else
    two=$1
    shift
  fi

  three=$(_longest_match $one $two)

  [ ${#three} -gt 0 ] && matches[$key]+="$three " || no_matches[$key]+="$one $two "
  [ ${#three} -gt 0 ] && one="$three" || one="$two"
done

  _output "${matches[@]} ${no_matches[@]}" | sort -u

_longest_match找到了一些灵​​感https://stackoverflow.com/a/23297950

我在测试文件中使用双重条目做了一些额外的测试:

$ cat testfile.new 
A,3300
B,8440
B,8440
U,3
U,7
U,7
U,73
B,8440
B,8443
B,8444
B,976
C,304
C,404
M,5502
M,5511

结果是:

$ sort -u testfile | datamash -t, -g1 collapse 2  \
| tr ',' ' ' | while read line ; do ./my_filter $line ; done
A,3300
B,844
B,976
C,304
C,404
M,55
U,3
U,7

这看起来像您预期的结果吗?

答案3

使用awk

BEGIN { OFS=FS="," }

prefixlength[$1] == "" {
        # First time seeing this label.
        # Remember the full string and its length.

        prefix[$1] = $2
        prefixlength[$1] = length($2)
        next
}

{
        # Compare the current string to the (current) longest
        # prefix related to this label. Update the prefix length
        # to the longest common prefix length.

        for (i = 1; i <= prefixlength[$1]; ++i)
                if (substr(prefix[$1],i,1) != substr($2,i,1)) {
                        prefixlength[$1] = i-1
                        break
                }
}

END {
        # Output labels and their longest prefix.

        for (i in prefix)
                print i, substr(prefix[i],1,prefixlength[i])
}

对于给定的输入,这将执行以下操作:

$ awk -f script file
A,3300
B,844
C,
M,55

由于当计算的最长前缀长度为零时,这会显示一个空前缀,因此如果您需要在这种特殊情况下显示所有字符串,您可能需要稍微修改代码:

BEGIN { OFS=FS="," }

prefixlength[$1] == "" {
        # First time seeing this label.
        # Remember the full string and its length.

        prefix[$1] = $2
        prefixlength[$1] = length($2)
        next
}

{
        # Remember all found strings in a sort of "array". The
        # strings added after the first will only ever be used
        # if the prefix length ends up as zero.

        prefix[$1] = prefix[$1] SUBSEP $2
}

{
        # Compare the current string to the (current) longest
        # prefix related to this label. Update the prefix length
        # to the longest common prefix length.

        for (i = 1; i <= prefixlength[$1]; ++i)
                if (substr(prefix[$1],i,1) != substr($2,i,1)) {
                        prefixlength[$1] = i-1
                        break
                }
}

END {
        # Output labels and their longest prefix. If the prefix length
        # is zero for a label, output all collected strings as separate
        # lines.

        for (i in prefix)
                if (prefixlength[i] > 0)
                        print i, substr(prefix[i],1,prefixlength[i])
                else {
                        n = split(prefix[i],a,SUBSEP)
                        for (j = 1; j <= n; ++j)
                                print i, a[j]
                }
}

测试:

$ awk -f script file
A,3300
B,844
C,304
C,404
M,55

相关内容