如何使用 awk 将平面列表分组为类别并创建制表符分隔的值文件?

如何使用 awk 将平面列表分组为类别并创建制表符分隔的值文件?

交叉发布:https://www.nixcraft.com/t/converting-a-list-into-a-tab-separated-file-grouped-by-values/4517

我有一个包含值列表的文本文件。目标是创建一个制表符分隔的值文件,我已经管理好了。之后我想按类别对它们进行分组。以下是我的列表的示例片段

|01BFRUITS|
^banana
^apple
^orange
^pear
|01AELECTRONICS|
^television
^radio
^dishwasher
^computer
|01AANIMAL|
^bear
^cat
^dog
^elephant
|01ASHAPE|
^circle
^square
^diamond
^star

以 PIPE 开头的值可视为标题,而以 CARET 开头的值则是其所在标题的值。

因此,经过大量的谷歌搜索、手册文件和通过论坛寻求帮助后...我设法想出了两个命令...在我的努力下,两个命令的输出都不同。

第一个命令:

cat test.txt | awk -v OFS='\t' '/^\|/{ c1=$0; gsub(/\|/,"",c1) } /^\^/{ c2=$0; sub(/^\^/,"",c2); print c1,OFS,c2 }' | sed -z s/\\r\\t\\t//g

其成果为:

01BFRUITS       banana
01BFRUITS       apple
01BFRUITS       orange
01BFRUITS       pear
01AELECTRONICS  television
01AELECTRONICS  radio
01AELECTRONICS  dishwasher
01AELECTRONICS  computer
01AANIMAL       bear
01AANIMAL       cat
01AANIMAL       dog
01AANIMAL       elephant
01ASHAPE        circle
01ASHAPE        square
01ASHAPE        diamond
01ASHAPE        star

第二条命令是:

cat test.txt | sed -z 's/\r\n\^/\t/g' | tr -d '|'

其成果为:

01BFRUITS       banana  apple   orange  pear
01AELECTRONICS  television      radio   dishwasher      computer
01AANIMAL       bear    cat     dog     elephant
01ASHAPE        circle  square  diamond star

现在,我的列表在测试运行中具有唯一值。我的新列表具有重复项,如下所示:

|01BFRUITS|
^banana
^apple
^orange
^pear
^banana
^apple
^orange
^pear
|01AELECTRONICS|
^television
^radio
^dishwasher
^computer
^television
^radio
^dishwasher
^computer
^television
^radio
^dishwasher
^computer
|01AANIMAL|
^bear
^cat
^dog
^elephant
^bear
^cat
^dog
^elephant
^bear
^cat
^dog
^elephant
^bear
^cat
^dog
^elephant
|01ASHAPE|
^circle
^square
^diamond
^star
^circle
^square
^diamond
^star
^circle
^square
^diamond
^star
^circle
^square
^diamond
^star
^circle
^square
^diamond
^star

我期望的输出是这样的:

01BFRUITS   banana  banana          
01BFRUITS   apple   apple           
01BFRUITS   orange  orange          
01BFRUITS   pear    pear            
01AELECTRONICS  television  television  television      
01AELECTRONICS  radio   radio   radio       
01AELECTRONICS  dishwasher  dishwasher  dishwasher      
01AELECTRONICS  computer    computer    computer        
01AANIMAL   bear    bear    bear    bear    
01AANIMAL   cat cat cat cat 
01AANIMAL   dog dog dog dog 
01AANIMAL   elephant    elephant    elephant    elephant    
01ASHAPE    circle  circle  circle  circle  circle
01ASHAPE    square  square  square  square  square
01ASHAPE    diamond diamond diamond diamond diamond
01ASHAPE    star    star    star    star    star

我的目的是将所有相同值的值组合在一起并维护左标题。我不知道如何使用 awk、sed 或 tr 来处理它。我确实在 excel 中找到了一种方法,但它占用了我旧电脑的处理能力,这很烦人。我认为 cli 会大大加快速度。

所以问题是这可以用 linux shell 来完成吗?如果可以,怎么做?

答案1

如果需要从如下格式的文件中删除重复的行:

01BFRUITS       banana
01BFRUITS       apple
01BFRUITS       orange
01BFRUITS       pear
01AELECTRONICS  television
01AELECTRONICS  radio
01AELECTRONICS  dishwasher
01AELECTRONICS  computer
01AANIMAL       bear
01AANIMAL       cat
01AANIMAL       dog
01AANIMAL       elephant
01ASHAPE        circle
01ASHAPE        square
01ASHAPE        diamond
01ASHAPE        star

你可以简单地使用cat list.txt | sort | uniq,或者,如果条目已经排序,只需cat list.txt | uniq

答案2

我能够使用 ChatGPT 生成以下内容:

cat test.txt | sed -z 's/\r\n\^/,/g' | tr -d '|' | awk -F, "{ key = \$1; for (i=2; i<=NF; i++) { values[key][\$i] = values[key][\$i]\",\"\$i } } END { for (key in values) { for (value in values[key]) { printf \"%s%s\n\", key, values[key][value] } } }"

预期结果完全正确。令人印象深刻!

我确实将文件分隔符从制表符改为逗号。

编辑:

以下是 ChatGPT 提供的细分:

# Set the field separator to a comma (,) for each line
awk -F, '{
    # Store the first field in a variable called "key"
    key = $1
    # For each subsequent field (starting from the 2nd), append its value to an array
    for (i=2; i<=NF; i++) {
        # Create an array called "values" that maps each key to an array of values
        # Concatenate the current value with any previously stored value(s) for this key, separated by a comma
        values[key][$i] = values[key][$i]","$i
    }
}
# After all lines are processed, iterate through the "values" array and output each key-value pair
END {
    # For each key in the "values" array
    for (key in values) {
        # For each value in the array associated with this key
        for (value in values[key]) {
            # Print the key and value (with any previously concatenated values)
            printf "%s%s\n", key, values[key][value]
        }
    }
}' input.txt

相关内容