我有一个.csv
这样的文件:
IMSI,MSISDN,SUBCATEGY,EPCMMEHOST,EPCMMEREALM,EPCROAMSCH,PSROAMSCH,OINR,MSCNUMBER,VLRNUMBER,SGSNNUMBER,EPCAPNNAME,EPCAPNAMBRUP,EPCAPNAMBRDOWN,EPCAMBRUP,EPCAMBRDOWN,CHARGE,EPCCHARGCHRT
NAN,NAN,2,,,0,,0,,,,OLAP$DOLAP,200000000$2048000,400000000$2048000,200000000,400000000,5,5
我想分割包含$
符号的所有字段,以便该符号之前的部分保留在原始行上,而之后的部分则转移到直接在下面的新创建的行中,其中文件的其他字段保持与以下相同的值在最初找到分割字符的行上。如果一行中的多个字段包含分割分隔符,则将所有字段的“第二”部分传输到同一新创建的行。
在上面的示例中,输出应如下所示:
IMSI,MSISDN,SUBCATEGY,EPCMMEHOST,EPCMMEREALM,EPCROAMSCH,PSROAMSCH,OINR,MSCNUMBER,VLRNUMBER,SGSNNUMBER,EPCAPNNAME,EPCAPNAMBRUP,EPCAPNAMBRDOWN,EPCAMBRUP,EPCAMBRDOWN,CHARGE,EPCCHARGCHRT
NAN,NAN,2,,,0,,0,,,,OLAP,200000000,400000000,200000000,400000000,5,5
NAN,NAN,2,,,0,,0,,,,DOLAP,2048000,2048000,200000000,400000000,5,5
答案1
我会假设@Paul_Pedant提到的观点确实是一个错字,并且该OLAP$DOLAP
字段也将被分割。在这种情况下,以下awk
程序(我们称之为splitfields.awk
)可能会起作用:
#/usr/bin/awk -f
{
copy=0;
for (i=1;i<=NF;i++)
{
j=index($i,"$");
if (j>0)
{
fields[i]=substr($i,j+1);
$i=substr($i,1,j-1);
copy=1
}
else fields[i]=$i;
}
print;
if (copy==1)
{
for (i=1;i<NF;i++) printf("%s%s",fields[i],OFS);
printf("%s%s",fields[NF],ORS)
}
}
你可以通过调用来使用它
user@host~$ awk -F, -v OFS="," -f splitfields.awk input.csv
该脚本检查每个字段中的$
符号,并在数组中复制该字段fields
以供进一步使用。如果$
找到,则相应的字段当前的线减少到部分前当数组$
中相应的条目fields
被填充时后这$
。
进行此检查后,无论如何都会打印(可能已修改的)当前行。此外,如果任何一个字段包含该符号,则会以 CSV 格式打印$
由数组内容组成的新行。fields
答案2
$ cat tst.awk
BEGIN { FS=OFS="," }
{
delete subFlds
numSubFlds = 0
for (i=1; i<=NF; i++) {
n = split($i,parts,/[$]/)
for (subFldNr=1; subFldNr<=n; subFldNr++) {
subFlds[i,subFldNr] = parts[subFldNr]
}
numSubFlds = (numSubFlds > n ? numSubFlds : n)
}
delete prev
for (subFldNr=1; subFldNr<=numSubFlds; subFldNr++) {
for (i=1; i<=NF; i++) {
curr = ( (i,subFldNr) in subFlds ? subFlds[i,subFldNr] : prev[i] )
printf "%s%s", curr, (i<NF ? OFS : ORS)
prev[i] = curr
}
}
}
。
$ awk -f tst.awk file
IMSI,MSISDN,SUBCATEGY,EPCMMEHOST,EPCMMEREALM,EPCROAMSCH,PSROAMSCH,OINR,MSCNUMBER,VLRNUMBER,SGSNNUMBER,EPCAPNNAME,EPCAPNAMBRUP,EPCAPNAMBRDOWN,EPCAMBRUP,EPCAMBRDOWN,CHARGE,EPCCHARGCHRT
NAN,NAN,2,,,0,,0,,,,OLAP,200000000,400000000,200000000,400000000,5,5
NAN,NAN,2,,,0,,0,,,,DOLAP,2048000,2048000,200000000,400000000,5,5
答案3
awk -F, -v OFS=, '{split("",q);for(i=j=1;i<=NF;i++){if(split($i,a,/\$/)>1){j++;$i=a[1];q[i]=a[2]}}print;if(j>1){for(i in q)$i=q[i];print}}' file
或者,扩展:
awk -F, -v OFS=, '{
split("", q) # clear the q array; delete q is non-standard
for(i = j = 1; i <= NF; i++){
if(split($i, a, /\$/) > 1){ # if the field contains a $
j++ # increase the cnt of fields with a $
$i = a[1] # set the field to the part before the $
q[i] = a[2] # save the part after the $ in the q array
}
}
print # print the line with the parts before $
if(j > 1){ # if any field had a $
for(i in q) # for each field which had a $
$i = q[i] # replace it with the part after $
print # re-print the line with the parts after $
}
}' file
答案4
如果我理解正确,某些字段是由 分隔的“双”变体字段$
,您只需要这些行的两个副本,其中第一行包含第一个变体,第二行包含第二个变体....
awk '{v1=$0; gsub(/\$[^,]+/,""); gsub (/,[^,]+\$/,",",v1); print $0; if (v1!=$0) print v1}' file1
IMSI,MSISDN,SUBCATEGY,EPCMMEHOST,EPCMMEREALM,EPCROAMSCH,PSROAMSCH,OINR,MSCNUMBER,VLRNUMBER,SGSNNUMBER,EPCAPNNAME,EPCAPNAMBRUP,EPCAPNAMBRDOWN,EPCAMBRUP,EPCAMBRDOWN,CHARGE,EPCCHARGCHRT
NAN,NAN,2,,,0,,0,,,,OLAP,200000000,400000000,200000000,400000000,5,5
NAN,NAN,2,,,0,,0,,,,DOLAP,2048000,2048000,200000000,400000000,5,5