交叉发布:https://www.nixcraft.com/t/converting-a-list-into-a-tab-separated-file-grouped-by-values/4517
我有一个包含值列表的文本文件。目标是创建一个制表符分隔的值文件,我已经管理好了。之后我想按类别对它们进行分组。以下是我的列表的示例片段
|01BFRUITS|
^banana
^apple
^orange
^pear
|01AELECTRONICS|
^television
^radio
^dishwasher
^computer
|01AANIMAL|
^bear
^cat
^dog
^elephant
|01ASHAPE|
^circle
^square
^diamond
^star
以 PIPE 开头的值可视为标题,而以 CARET 开头的值则是其所在标题的值。
因此,经过大量的谷歌搜索、手册文件和通过论坛寻求帮助后...我设法想出了两个命令...在我的努力下,两个命令的输出都不同。
第一个命令:
cat test.txt | awk -v OFS='\t' '/^\|/{ c1=$0; gsub(/\|/,"",c1) } /^\^/{ c2=$0; sub(/^\^/,"",c2); print c1,OFS,c2 }' | sed -z s/\\r\\t\\t//g
其成果为:
01BFRUITS banana
01BFRUITS apple
01BFRUITS orange
01BFRUITS pear
01AELECTRONICS television
01AELECTRONICS radio
01AELECTRONICS dishwasher
01AELECTRONICS computer
01AANIMAL bear
01AANIMAL cat
01AANIMAL dog
01AANIMAL elephant
01ASHAPE circle
01ASHAPE square
01ASHAPE diamond
01ASHAPE star
第二条命令是:
cat test.txt | sed -z 's/\r\n\^/\t/g' | tr -d '|'
其成果为:
01BFRUITS banana apple orange pear
01AELECTRONICS television radio dishwasher computer
01AANIMAL bear cat dog elephant
01ASHAPE circle square diamond star
现在,我的列表在测试运行中具有唯一值。我的新列表具有重复项,如下所示:
|01BFRUITS|
^banana
^apple
^orange
^pear
^banana
^apple
^orange
^pear
|01AELECTRONICS|
^television
^radio
^dishwasher
^computer
^television
^radio
^dishwasher
^computer
^television
^radio
^dishwasher
^computer
|01AANIMAL|
^bear
^cat
^dog
^elephant
^bear
^cat
^dog
^elephant
^bear
^cat
^dog
^elephant
^bear
^cat
^dog
^elephant
|01ASHAPE|
^circle
^square
^diamond
^star
^circle
^square
^diamond
^star
^circle
^square
^diamond
^star
^circle
^square
^diamond
^star
^circle
^square
^diamond
^star
我期望的输出是这样的:
01BFRUITS banana banana
01BFRUITS apple apple
01BFRUITS orange orange
01BFRUITS pear pear
01AELECTRONICS television television television
01AELECTRONICS radio radio radio
01AELECTRONICS dishwasher dishwasher dishwasher
01AELECTRONICS computer computer computer
01AANIMAL bear bear bear bear
01AANIMAL cat cat cat cat
01AANIMAL dog dog dog dog
01AANIMAL elephant elephant elephant elephant
01ASHAPE circle circle circle circle circle
01ASHAPE square square square square square
01ASHAPE diamond diamond diamond diamond diamond
01ASHAPE star star star star star
我的目的是将所有相同值的值组合在一起并维护左标题。我不知道如何使用 awk、sed 或 tr 来处理它。我确实在 excel 中找到了一种方法,但它占用了我旧电脑的处理能力,这很烦人。我认为 cli 会大大加快速度。
所以问题是这可以用 linux shell 来完成吗?如果可以,怎么做?
答案1
如果需要从如下格式的文件中删除重复的行:
01BFRUITS banana
01BFRUITS apple
01BFRUITS orange
01BFRUITS pear
01AELECTRONICS television
01AELECTRONICS radio
01AELECTRONICS dishwasher
01AELECTRONICS computer
01AANIMAL bear
01AANIMAL cat
01AANIMAL dog
01AANIMAL elephant
01ASHAPE circle
01ASHAPE square
01ASHAPE diamond
01ASHAPE star
你可以简单地使用cat list.txt | sort | uniq
,或者,如果条目已经排序,只需cat list.txt | uniq
答案2
我能够使用 ChatGPT 生成以下内容:
cat test.txt | sed -z 's/\r\n\^/,/g' | tr -d '|' | awk -F, "{ key = \$1; for (i=2; i<=NF; i++) { values[key][\$i] = values[key][\$i]\",\"\$i } } END { for (key in values) { for (value in values[key]) { printf \"%s%s\n\", key, values[key][value] } } }"
预期结果完全正确。令人印象深刻!
我确实将文件分隔符从制表符改为逗号。
编辑:
以下是 ChatGPT 提供的细分:
# Set the field separator to a comma (,) for each line
awk -F, '{
# Store the first field in a variable called "key"
key = $1
# For each subsequent field (starting from the 2nd), append its value to an array
for (i=2; i<=NF; i++) {
# Create an array called "values" that maps each key to an array of values
# Concatenate the current value with any previously stored value(s) for this key, separated by a comma
values[key][$i] = values[key][$i]","$i
}
}
# After all lines are processed, iterate through the "values" array and output each key-value pair
END {
# For each key in the "values" array
for (key in values) {
# For each value in the array associated with this key
for (value in values[key]) {
# Print the key and value (with any previously concatenated values)
printf "%s%s\n", key, values[key][value]
}
}
}' input.txt