我有一个 CSV 文件,其一般格式如下图所示。
在该 CSV 中,有多行属于某个列 ( desc
),我想提取这些项目并将它们添加到新的列name, size, weight, glass
分别称为 。我已突出显示(以红色)条目的那些子行项目。
原始结构:
预期结构:
原始 CSV:
page,item,id,desc,price
1,2,F1,Alpha,111.11
1,,,380 x 2280 mm (size),
1,,,Weight: 33.0 kg,
1,,,Glass: AAA,
1,,,XXX,
1,3,F2,Beta,222.22
1,,,880 x 2280 mm (size),
1,,,Weight: 68.4 kg,
1,,,Glass: BBB,
1,,,YYY,
1,4,F3,Gamma,333.33
1,,,1980 x 1580 mm (size),
1,,,Weight: 78.2 kg,
1,,,Glass: CCC,
1,,,XXX,
1,,,YYY,
1,,,ZZZ,
预期生成的 CSV:
page,item,id,name,size,weight,glass,price
1,2,F1,Alpha,380 x 2280,33.0,AAA,111.11
1,3,F2,Beta,880 x 2280,68.4,BBB,222.22
1,4,F3,Gamma,1980 x 1580,78.2,CCC,333.33
在哪里姓名将取代第一行的位置描述。
更新:
在某些条件下,某些 Awk 解决方案可能适用于上述内容,但在添加第四项时会失败。要进行全面测试,请考虑将其添加到上面:
1,7,F4,Delta,111.11
1,,,11 x 22 mm (size),
1,,,Weight: 33.0 kg,
1,,,Glass: DDD,
1,,,Random-1,
所以3个要点:
- 列中子行的数量
desc
可以变化。 - 之后的任何子行
Glass:...
都应被忽略。 - 可能有项目没有任何子行在
desc
列中,它们也应该被忽略。
问:如何使用以下命令将这些子行重新映射到新列中awk?
(或者是否有更合适的工具在 bash 中执行此操作?)
可能相关(但不是很有帮助)的问题:
答案1
awk 'BEGIN{ FS=OFS=","; print "page,item,id,name,size,weight,glass,price" }
$2!=""{ price=$5; data=$1 FS $2 FS $3 FS $4; desc=""; c=0; next }
{ gsub(/ ?(mm \(size\)|Weight:|kg|Glass:) ?/, "") }
++c<=3{ desc=(desc==""?"":desc OFS) $4; next }
data { print data, desc, price; data="" }
' infile
包括解释:
awk 'BEGIN{ FS=OFS=","; print "page,item,id,name,size,weight,glass,price" }
#this block will be executed only once before reading any line, and does:
#set FS (Field Separator), OFS (Output Field Separator) to a comma character
#print the "header" line ....
$2!=""{ price=$5; data=$1 FS $2 FS $3 FS $4; desc=""; c=0; next }
#this blocks will be executed only when column#2 value was not empty, and does:
#backup column#5 into "price" variable
#also backup columns#1~4 into "data" variable
#reset the "desc" variable and also counter variable "c"
#then read next line and skip processing the rest of the code
{ gsub(/ ?(mm \(size\)|Weight:|kg|Glass:) ?/, "") }
#this block runs for every line and replace strings above with empty string
++c<=3{ desc=(desc==""?"":desc OFS) $4; next }
#this block runs at most 3reps and
#joining the descriptions in column#4 of every line
#and read the next line until counter var "c" has value <=3
data { print data, desc, price; data="" }
#if "data" variable has containing any data, then
#print the data, desc, price and empty "data" variable
' infile