用于计算列中负值出现次数并列出相关行名称的 awk 脚本

Question 1

这从第一行读取列名称，而不是对它们进行硬编码。如果可以删除第一行中的额外空格，将有助于使输出更漂亮。

编辑：

#!/usr/bin/awk -f
# The arrays are
# name, indexed by column number, the names of the columns taken from the first line.
# cl, indexed by the column name, the list of countries for which
#    this column is negative.
# cnt, indexed by column name, the count of the number of countries.
BEGIN { FS="," }
NR==1 { for(i=2;i<=NF;i++) { name[i]=$i } ; next }
{
    # loop over the columns
    for(i=2;i<=NF;i++) {
        # get the value of the column as a number
        v=$i+0
        # move on to the next column if the value is non negative.
        if (v>=0) continue;
        # get the name of the column
        n=name[i]
        # increment the count and add the country onto the list
        cnt[n]++
        cl[n] = cl[n]  $1  ", "
    }
}
END { # At the end, loop over the results.
      for (i in name) {
        # get the column name
        n=name[i]
        # print out the saved data
        printf("%d %s, %s\n",cnt[n]+0, n, cl[n]); }}

请注意，输出的顺序没有明确定义。

一般来说，如果有人要求澄清问题，提供它会很有帮助。

Answer

这从第一行读取列名称，而不是对它们进行硬编码。如果可以删除第一行中的额外空格，将有助于使输出更漂亮。

编辑：

#!/usr/bin/awk -f
# The arrays are
# name, indexed by column number, the names of the columns taken from the first line.
# cl, indexed by the column name, the list of countries for which
#    this column is negative.
# cnt, indexed by column name, the count of the number of countries.
BEGIN { FS="," }
NR==1 { for(i=2;i<=NF;i++) { name[i]=$i } ; next }
{
    # loop over the columns
    for(i=2;i<=NF;i++) {
        # get the value of the column as a number
        v=$i+0
        # move on to the next column if the value is non negative.
        if (v>=0) continue;
        # get the name of the column
        n=name[i]
        # increment the count and add the country onto the list
        cnt[n]++
        cl[n] = cl[n]  $1  ", "
    }
}
END { # At the end, loop over the results.
      for (i in name) {
        # get the column name
        n=name[i]
        # print out the saved data
        printf("%d %s, %s\n",cnt[n]+0, n, cl[n]); }}

请注意，输出的顺序没有明确定义。

一般来说，如果有人要求澄清问题，提供它会很有帮助。

Question 2

下面首先使用删除逗号周围的所有空格sed（这可以更仔细地使用，例如，csvformat -S如果存在包含空格的标题字段，但对于问题中提供的数据来说已经足够了）。管道使用转置数据datamash，然后输出每行具有负值的国家/地区。

#!/bin/sh

sed 's/ *, */,/g' file |
datamash -t, transpose |
awk -F, '
    BEGIN { OFS = FS }
    NR == 1 { for (i = 2; i <= NF; ++i) h[i] = $i; next }
    {
        nf = split($0,a)
        $0 = a[1]

        for (i = 2; i <= nf; ++i)
            if (a[i] < 0) $(NF+1) = h[i]

        if (NF > 1) print NF-1, $0
    }'

NR == 1代码中的块仅awk针对来自的第一行输入执行datamash。的输出datamash将是

Country,Poland,Canada,Italy,France,Portugal
COL2,-0.3,-1,7,1,1
COL3,0,1,-5,2,NULL
COL4,2,1,3,-0.5,4
COL5,-0.5,-0.4,-0.1,7,1

这意味着该数组h将包含第一行的标题。

对于中的所有其他输入行datamash，我们创建由相应数字为负数的国家/地区组成的输出记录。为此，我们将输入行以逗号分隔到数组中a，然后将当前记录 , 重置$0为a[1]，它是字符串之一COL。然后，我们循环遍历的其他条目，只要发现小于零的数字，a就添加当前记录中的标头。h

然后，我们打印当前记录中的字段数（减去一以表示字符串COL）以及记录本身。

给出问题中数据的输出：

2,COL2,Poland,Canada
1,COL3,Italy
1,COL4,France
3,COL5,Poland,Canada,Italy

你可以将print最后的更改为

printf "%d\t%s\n", NF-1, $0

...如果您想让第一列与其他列用制表符分隔：

2       COL2,Poland,Canada
1       COL3,Italy
1       COL4,France
3       COL5,Poland,Canada,Italy

通过以下输入，

COUNTRY NAME, SOCIAL SUPPORT, FREEDOM TO MAKE LIFE CHOICES, GENEROSITY, PERCEPTIONS OF CORRUPTION, POSITIVE AFFECT, NEGATIVE AFFECT, CONFIDENCE IN NATIONAL GOVERNMENT, DEMOCRATIC QUALITY, DELIVERY QUALITY
Afghanistan, 0.49, NULL, -0.11, 0.95, 0.49, 0.37, -0.26, -1.88, -1.43
Albania, 0.63, NULL, -0.03, 0.87, 0.66, 0.33, -0.45, 0.29, -0.13
Algeria, 0.80, NULL, -0.19, 0.69, 0.64, 0.34, 0.24, -0.92, -0.81
Argentina, 0.90, NULL, -0.18, 0.84, 0.80, 0.29, 0.30, 0.35, 0.15

...脚本产生

4,GENEROSITY,Afghanistan,Albania,Algeria,Argentina
2,CONFIDENCE IN NATIONAL GOVERNMENT,Afghanistan,Albania
2,DEMOCRATIC QUALITY,Afghanistan,Algeria
3,DELIVERY QUALITY,Afghanistan,Albania,Algeria

Answer